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SUMMARY 


This  work  considers  Markov  decision  processes  with  policy 
constraints.  The  selection  of  an  optimal  stationary  policy  for 
such  processes,  in  the  absence  of  policy  constraints,  is  a problem 
which  has  received  a great  deal  of  attention  and  has  been 
satisfactorily  solved.  Relatively  little  attention  has  been  given 
to  the  case  when  policy  constraints  are  present.  Optimal  policy 
sensitivity  analysis  is  also  a subject  in  which  little  has  been 
achieved.  We  develop,  in  this  paper,  a computationally  efficient 
iterative  algorithm  for  selecting  the  optimal  policy  for  completely 
ergodic,  inf inite-time  horizon  Markov  decision  processes  with  polic 
constraints  for  both  the  risk- indifferent  and  risk-sensitive  cases. 
The  sensitivity  of  optimal  policies  vis-a-vis  the  constraints  is 
also  analyzed,  and  the  algorithm  is  used  to  quantify  the  analysis. 

An  important  limitation  of  all  previous  analyses  of  Markov 
decision  processes  is  the  implicit  assumption  that  selecting  an 
alternative  in  any  one  state  has  no  effect  on  alternative  selection 
in  any  other  state.  If  that  assumption  does  not  hold,  we  have 
"policy  constraints".  Some  policies  become  "infeasible",  i.e. 
unallowable.  One  method  of  dealing  with  such  a situation  was 
proposed  for  risk-indifferent  Markov  decision  processes  [lO].  The 
policies  can  be  ordered  and,  after  determining  the  optimal  policy, 
we  can  go  backwards  in  the  ordering,  checking  each  policy  for 
"feasibility".  This  method,  however,  becomes  computationally 


ii 


burdensome  after  the  second-best  policy.  Moreover,  no  method  of 
ordering  has  been  devised  for  risk-sensitive  Markov  decision 
processes.  The  present  work  shows  how  to  treat  efficiently  "policy 
constrained"  problems  for  both  risk-indifferent  and  risk-sensitive 
Markov  decision  processes  by  proceeding  from  one  feasible  policy  to 
a better  one  until  the  optimal  feasible  policy  is  obtained.  Our 
point  of  departure  is  reformulating  the  Markov  decision  process  in 


the  absence  of  policy  constraints  as  a constrained  maximization 
problem.  The  Lagrange  multiplier  rule  is  then  applied  to  decompose 
the  problem  into  two  iteratively  coupled  ones.  This  yields  the 


existing  algorithms  and  indicates  how  to  develop  a new  algorithm 
to  solve  "policy  constrained  problems". 

Chapter  II  is  devoted  to  risk-indifferent  Markov  Decision 
Processes.  After  reviewing  previous  work,  namely,  Howard's 
algorithm  and  the  Linear  Programming  formulation,  we  embark  upon 
formulating  policy  constraints.  Then  the  Lagrange  multiplier 
formulation  is  outlined  and  pursued  to  its  consequences.  This 
leads  to  the  development  of  an  efficient  algorithm,  along  the 
VD-PI  lines,  whose  convergence  is  proved.  The  algorithm  is 
applied  to  Howard's  famous  taxicab  example  after  policy  constraints 
are  introduced  to  it.  All  the  foregoing  deals  with  completely 
ergodic  Markov  processes  in  which  all  states  are  recurrent.  We 
outline  how  the  algorithm  is  modified  when  it  encounters  coupled 
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states  which  are  transient.  We  also  discuss  periodic  Markov 


processes . 

Chapter  III  is  devoted  to  risk-sensitive  Markov  Decision 
Processes.  As  ir.  chapter  II,  Howard's  and  Matheson's  algorithm 
is  reviewed,  a Lagrange  multiplier  formulation  is  developed,  and 
an  algorithm  emerges.  Its  convergence  is  proved,  and  it  is 
applied  to  the  same  previous  example  with  a risk  aversion 
coefficient.  Finally,  it  is  pointed  out  that  transient  states 
have  no  effect  on  the  algorithm,  and  that  since  periodic  processes 
are  inherently  deterministic  problems,  they  are  better  solved  by 
risk- indifferent  methods. 

Chapter  IV  deals  with  sensitivity  analysis.  The  concepts  of 
"constraint-indifferent"  and  "constraint-sensitive"  optimal 
policies  are  introduced,  and  a procedure  for  computing  the  worth 
of  individual  constraints  is  outlined.  It  is  explained  by  applying 
it  to  the  example  solved  in  chapter  II. 

In  chapter  V,  we  discuss  modifications  of  the  algorithm  for 
problems  having  a large  number  of  states  and  give  the  computational 
results  of  Howard's  baseball  problem.  We  also  make  some  suggestions 
concerning  future  research. 
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Chapter  I 


INTRODUCTION 


A.  Background  and  Motivation 


The  context  of  this  work  is  completely  ergodic  Markov  processes  with 
rewards,  which  are  allowed  to  run  for  an  unlimited  number  of  transitions. 
The  basic  problem  we  are  concerned  with  is  selecting,  from  among  a set  of 
such  processes,  one  that  yields  the  highest  average  return.  Figure  1.1 
illustrates  such  a Markov  Decision  Process.  In  states  1 and  3 we  have 
three  alternatives  to  choose  from,  and  in  state  2 we  only  have  two  alter- 
natives. Associated  with  each  alternative  k in  state  i are  three 
probabilities  of  transition  p.  . (j  = 1,  2,3)  from  state  i to  all 
states  and  the  rewards  r_  of  making  said  transitions  under  the  given 
alternative.  In  each  state,  k can  take  on  values  of  1,2,...,K  , where 
is  the  number  of  alternatives  available  in  state  i (3,  2,  and  3, 
respectively,  for  Figure  1.1).  Once  an  alternative  is  chosen  in  each 

state  (i.e.,  k given  a value  between  1 and  k^  in  each  state  i)  , we 

th 

have  what  is  called  a stationary  policy  P.  The  i component  of  P is 
alternative  selected  in  state  i.  In  Figure  1.1,  e.g.,  we  gave  k the 
values  1,  2,  and  2 in  states  1,  2,  and  3,  respectively  (i.e.,  we  selected 
the  first  alternative  in  state  1 and  the  second  alternative  in  states  2 


and  3).  In  other  words,  we  have  a policy 


P = (1,2,2) 


We  denote  the  i component  of  P by  P(i).  Thus,  for  the  above  policy 


P(l)  = 1,  P(2 ) = 2,  P(3 ) = 2. 
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Fig.  1.1.  THREE-STATE  MARKOV  DECISION  PROCESS. 


Such  a policy  is  stationary  in  the  sense  that  every  time  the  process 
is  in  state  i alternative  P(i)  is  always  selected. 

Once  a policy  is  selected,  a Markov  process  with  rewards  is  defined. 
The  transition  probability  matrix  and  the  reward  matrix  are  composed  of 
rows  determined  by  each  alternative  in  each  state.  For  the  policy  men- 
tioned in  Figure  1.1,  we  get  the  transition  probability  matrix 
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with  a similar  reward  matrix  R. 


2 


For  a completely  ergodic  Markov  process,  the  limiting  state  proba- 


t 


bilities  are  independent  of  the  starting  state.  They  give  the  average 
number  of  stages  the  process  spends  in  each  state  and  are  given  by  [4, 
5] 


_.T  „ T 

n • p = it 


N 

Y 


(1.1) 


1=1 


It.  = 1 
1 


where  it  is  the  column  vector  whose  components  ir^  are  the  limiting 


state  probabilities,  P is  the  trsnsition  probability  matrix  of  the 
process,  and  N is  the  number  of  states.  Also  associated  with  a Mar- 
kov process  is  the  vector  q of  immediate  expected  rewards  defined  by 


N 


q.  = ^ p.  .r.  . 
1 j=l  11  1J 


(1.2) 


For  a risk-indifferent  decision  maker,  Howard  [4,51  has  shown  that 
if  the  Markov  process  is  allowed  to  run  for  an  unlimited  number  of  tran- 
sitions, the  average  reward  of  the  process  per  transition,  hereafter  to 
be  called  the  gain  of  the  process,  is  given  by 


g 


I 

i=l 


*iqi 


(1.3) 


where  the  q^  are  given  by  (1.2)  and  the  Jt^  are  given  by  (1.1)  for  the 
specified  process  (i.e.,  the  selected  policy) . 

For  the  problem  illustrated  in  Figure  1.1,  we  have  18  policies  to 
select  among.  Each  results  in  a process  for  which  (1.1)  can  be  solved; 
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then  (1.2)  and  (1.3)  are  used  to  compute  the  gain.  A "brute  force" 
method  for  selecting  the  optimal  policy  would  be  to  do  this  for  all  18 


policies.  However,  the  number  of  policies  increases  astronomically  as 


the  number  of  alternatives  increases.  Actually,  the  number  of  policies 
N 

is  TIj,  K^.  Thus,  in  the  above  example,  if  an  additional  alternative 


were  introduced  in  state  2,  we  would  immediately  have  9 more  policies 


to  take  into  account.  We  are  hence  faced  with  an  essentially  combina- 
torial problem. 


Howard  [4,5]  devised  an  extremely  efficient  iterative  algorithm 


which  exploits  certain  features  of  this  problem  to  solve  it.  First,  a 


value  determination  (VD)  is  made  for  a given  policy.  The  VD  consists 


of  computing  "relative  values"  v^  for  each  state,  under  the  given  pol- 


icy, and  the  gair  of  that  policy  from 


Then,  using  the  v^,  an  attempt  is  made  to  improve  the  policy, 
i.e.,  detect  a policy  of  larger  gain  than  the  current  one.  To  this  end, 
test  quantities  are  defined  in  each  state  for  each  alternative.  Then, 
the  alternative  yielding  the  largest  test  quantity  in  each  state  is  se- 
lected. Formally, 


If  the  policy  improvement  (PI)  does  not  change  the  current  policy, 
it  is  the  optimal  policy.  Otherwise,  VD  and  PI,  i .e . , (1.4)  and  (1.5), 
alternate  until  we  converge  to  the  optimal  policy.  The  efficiency  of 
the  VD-PI  algorithm  results  from  reducing  the  combinatorial  problem  of 
simultaneously  selecting  different  alternatives  in  different  states  to 
a set  of  discrete  maximization  problems  which  select  the  alternatives 
in  each  state  independently  of  other  states  . 

Mine  and  Osaki  [91  formulated  the  risk-indifferent  Markov  Decision 
Process  as  a linear  program  (LP) . They  showed  that  the  VD  is  the  solu- 
tion of  a dual  problem,  whence  the  relative  values  v are  the  simplex 
multipliers.  They  also  showed  that  the  PI  is  a simplex  iteration  with 
at  most  N simultaneous  pivoting  operations.  Using  the  LP  formulation, 
Nesbitt  [10]  showed  how  the  policies  can  be  ordered  according  to  gain. 

For  the  risk-sensitive  decision  maker  possessing  an  exponential 
utility  function,  Howard  and  Matheson  [6]  define  a "disutility  contri- 
bution" matrix  Q,  whose  elements  are  given  by 


-yr . 


q.  . = p.  . e 
ij  ij 


ij 


(1.6) 


where  the  p^  and  r_  are  the  probabilities  and  rewards  selected  by 
a given  policy,  and  y is  the  risk  aversion  coefficient.  They  derive 
a certain  equivalent  gain  given  by 


S 


- - In  A 

7 


(1.7) 


where  A is  the  "maximal  eigenvalue"  of  Q (the  largest  positive  ei- 
genvalue which  exceeds  the  modulii  of  all  other  eigenvalues  of  Q) . It 
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is  this  g that  has  to  be  maximized.  They  devised  an  algorithm  quite 


similar  to  the  VD-PI . It  consists  of  a policy  evaluation  (PE)  phase, 
the  counterpart  of  the  VD.  In  PE,  the  utilities  u.  pertaining  to  a 
given  policy  are  computed  by  solving  the  eigenvalue  problem 

N. 

Au  = ^ q u 1 = i.2,  ....  N (1.8) 

j=l  J 


where  the  q are  given  by  (1.6)  and  A is  the  maximal  eigenvalue  of 
the  corresponding  matrix  Q. 

Then  a policy  improvement  (PI)  phase  is  undertaken.  It  is  identi- 
cal to  the  PI  of  the  risk-indifferent  case,  except  that  the  test  quan- 
tities are  different.  Here,  a policy  is  selected  such  that 


I Q 

P(i)  = <a  : t.  = max 

) k=l , 2 , . . . ,K^ 

i = 1 ,2,  . . . , N (1.9) 

In  all  of  the  afore-mentioned  work,  no  restrictions  are  made  on  the 
manner  in  which  alternatives  are  selected  in  different  states.  It  is 
assumed  that  the  selection  of  an  alternative  in  a given  state  has  no  ef- 
fect on  alternative  selection  in  any  other  state.  In  other  words,  there 
is  no  interaction,  or  "coupling,"  between  alternatives  in  different 
states.  This  is  the  feature  that  allows  considering  each  state  separ- 
ately in  the  PI.  However,  it  is  an  idealized  situation  that  might  or 
might  not  hold  in  real  life.  With  the  profusion  of  rules  and  regula- 
tions governing  economic  activities  in  this  day  and  age,  it  might  very 
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' N 

y 

j=i 


k 

q . .u  . 

ij  .1 


1 


! 


t 


; i 
I 4 

; 
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well  turn  out  that  alternative  selection  is  not  as  "coupling-free"  as 
the  idealized  situation  envisions.  (Later  in  this  work,  we  give  a sim- 
ple example  of  how  alternatives  in  different  states  could  be  coupled.) 

Introducing  inter-state  dependence  in  alternative  selection  in 
effect  imposes  constraints  on  the  policies.  Some  policies  become  "in- 
feasible," and  we  have  to  select  the  optimal  "feasible"  policy,  where 
"feasibility"  means  satisfying  the  constraints.  This  work  strives  to 
do  exactly  that,  in  a computationally  efficient  manner.  Nesbitt  flOl 
suggests  that  for  policy  constrained  problems,  we  start  with  the  optimal 
policy  in  the  absence  of  constraints  and  go  backwards  in  the  ordering 
checking  feasibility  until  we  hit  the  feasible  policy  yielding  the  high- 
est gain.  The  problem  with  this  brute  force  method  is  that,  once  we  get 
beyond  the  second-to-optimum  policy,  we  have  to  evaluate  an  increas ingly 
large  number  of  policies  for  each  step  in  the  ordering  process.  More- 
over, we  first  have  to  solve  the  unconstrained  policy  problem  before  we 
can  solve  the  constrained  policy  one.  Also,  there  has  been  no  work  on 
policy  ordering  for  the  risk-sensitive  case.  What  is  really  needed  is 
an  algorithm  that  retains  as  much  of  the  simplicity  and  efficiency  of 
Howard's  VD-PI  and  PE-PI  algorithms  as  possible,  while  taking  feasibil- 
ity into  account  as  it  progresses  from  one  feasible  policy  to  a better  one  . 

Another  aspect  of  the  constrained  policy  problem  is  that  there  has 
been  no  work  done  on  formulating  the  constraints  mathematically  in  a 
systematic  manner.  This  we  also  strive  to  achieve. 

Also  of  interest  in  a constrained  policy  problem,  is  the  "sensitiv- 
ity" of  optimal  policies  to  the  policy  constraints.  This  "sensitivity" 
can  best  be  expressed  in  terms  of  value,  i.e.,how  much  a rational  deci- 
sion maker  would  be  willing  to  pay  to  remove  a constraint. 
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To  summarize  then,  this  work  is  mainly  concerned  with  three  things: 
the  formulation  of  policy  constraints  ; developing  a VD-PI  type  of  algo- 
rithm for  completely  ergodic,  infinite  time  horizon  Markov  Decision  Pro- 
cesses ; and  sensitivity  analysis  of  optimal  policies  vis-a-vis  the  policy 
constra ints . 


B . Methods  and  Results 

Our  point  of  departure  is  the  LP  formulation.  There,  a quantity 

dj^  is  introduced.  That  quantity  represents  the  conditional  probability 

that,  given  the  process  is  in  state  i,  alternative  k is  selected.  In 

the  LP  formulation,  it  is  proved  that  for  each  state  i,  only  one  d^ = 1 

and  the  rest  are  zero.  This  is  exactly  what  we  want.  We  will  formulate 

both  the  Markov  Decision  Process  and  the  policy  constraints  in  terms  of 

the  dk . 

1 

For  policy  constraints,  we  first  concentrate  on  the  "two-alterna- 

tive-coupling"  case.  By  this,  we  mean  interaction  between  an  alternative 

k in  state  i and  another  alternative  Z in  some  other  state  j . The 

k Z 

constraints  will  be  expressed  in  terms  of  d^  and  d^.,  to  be  denoted 
by  a and  b,  respectively,  for  simplicity  (a  and  b can  only  take 
on  values  of  zero  or  unity).  By  exhaustion  of  all  possible  combinations 
and  straightforward  application  of  simple  logic,  we  conclude  that  there 
can  only  be  five  different  types  of  constraints: 

a + b < 1 (1.10) 

< > a + b > 1 (1.11) 

'==>  a - b > 0 (1.12) 
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The  designations  to  the  left  of  the  constraints  are  those  we  use  on 


a graph  or  table  to  indicate  the  type  of  constraint  (as  in  Figure  1.2) 


Inequality  (1.10)  expresses  the  constraint  stating  that,  at  most,  one  of 


alternatives  a and  b is  allowable  in 


states  that  at  least  one  has  to  be  present;  (1.12)  states  that  alterna 


tive  b is  not  allowable  in  any  policy  unless  it  is  accompanied  by  a 


g.  1.2.  FIVE-STATE  POLICY  CONSTRAINED  MARKOV  DECISION- 
PROCESS  . 


r 


- 


(1.13)  states  that  exactly  one  of  the  two  alternatives  must  be  present 
in  any  policy;  (1.14)  states  that  we  can  have  either  both  or  neither 
alternatives  in  any  feasible  policy.  In  Figure  1.2,  e.g.,  alternative 
4 in  state  2 cannot  be  selected  unless  alternative  3 in  state  1 is. 
Similarly,  as  regards  alternative  2 in  state  2 and  alternative  1 in 
state  3,  we  cannot  have  more  than  one  of  them  in  any  feasible  policy. 

! 

If  more  than  two  alternatives  are  coupled  by  any  single  constraint,  we 
resort  to  the  algebra  of  events  to  express  the  constraints  as  a Boolean 
expression,  with  truth  and  falsity  being  assigned  the  values  unity  and 
zero,  respectively. 

If  the  Boolean  expression  is  an  exclusive  OR,  it  is  equated  to 
unity  to  give  us  our  constraint.  This  is  because,  by  definition,  only 
one  component  of  an  exclusive  or  can  be  true.  Otherwise,  because  of 
the  possibility  of  more  than  one  component  being  true  simultaneously, 
the  expression  is  set  greater  or  equal  to  unity.  Of  course,  some  con- 
straints involving  more  than  one  alternative  can  be  intuitively  trans- 
lated into  an  algebraic  relationship  without  recourse  to  the  algebra  of 
events.  One  such  type  of  constraint  is  considered  here  because  it  is 

i 

of  special  significance  later  on.  Assume  that  we  have  a policy  P 
whose  first  M components  (M  >1)  are  a,b,c,...,m,  respectively. 

i 

If  we  want  to  make  P and  all  policies  that  differ  with  P in  exactly 
* ! one  of  the  first  M components  infeasible-,  we  can  do  that  by  the  sim- 


ple constraint 


Once  we  develop  a methodology  for  expressing  policy  constraints,  we 
turn  to  developing  a VD-PI  type  of  algorithm  to  handle  policy  constrained 
problems.  To  this  end,  we  exploit  the  fact  that  the  original  problem  we 
are  faced  with  is  a constrained  optimization  one,  even  in  the  absence  of 
policy  constraints.  For  the  risk-indifferent  case,  the  objective  func- 
tion to  be  maximized  is  the  gain.  The  constraints  are  the  equations  de- 
fining  the  limiting  state  probabilities  and  those  requiring  the  d^  to 
sum  to  unity  in  each  state.  If  we  have  policy  constraints,  they  will  be 

j- 

additional  relationships  between  d^  of  different  states. 

The  realization  that  we  are  dealing  with  a constrained  optimization 
problem  leads  us  to  the  Lagrange  multiplier  rule,  which  enables  us  to  re- 
duce the  problem  to  two  iteratively  coupled  problems  defined  on  the  asso- 
ciated Lagrangian.  One  of  them  turns  out  to  be  the  VD,  while  the  other 
is  the  maximization  of  the  Lagrangian  over  the  discrete  set  of  feasible 
policies.  It  turns  out  that  the  relevant  quantity  to  be  maximized  is  the 
weighted  sum  of  the  test  quantities.  The  weights  are  the  limiting  state 
probabilities.  In  the  absence  of  policy  constraints,  the  individual  com- 
ponents of  the  sum  can  be  maximized  in  order  to  maximize  the  sum.  The 
weight  tc  in  each  state  then  becomes  irrelevant,  and  we  just  maximize 
the  test  quantities,  which  is  what  the  PI  does.  Hence,  the  PI  actually 
maximizes  the  Lagrangian,  exploiting  the  absence  of  interstate  coupling 
to  decompose  that  maximization,  an  essentially  combinatorial  problem,  to 
N much  simpler  maximizations.  In  the  presence  of  policy  constraints, 
however,  such  a decomposition  can  no  longer  be  effected.  We  have  to  face 
the  combinatorial  maximization  problem  head  on.  This  we  do  by  adapting 
a branch  and  bound  technique  to  our  problem.  This  is  a method  whereby 
maximization  is  achieved  without  having  to  enumerate  the  feasible  set. 
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We  prove  that  such  a method  converges  to  a policy  that  maximizes  the  gain 


I 


over  the  set  of  feasible  policies  differing  with  it  in  exactly  one  state. 
Therefore,  we  add  a constraint  of  type  (1.15)  and  start  on  another  set. 
In  this  manner,  we  remove  whole  subsets  from  consideration  without  having 
to  consider  all  the  elements  belonging  to  them. 

To  retain  as  much  of  the  VD-PI  as  possible,  we  introduce  the  notions 
of  "free"  and  "coupled"  states.  The  former  are  states  in  which  no  alter- 
natives arc  involved  in  any  policy  constraints.  The  "coupled"  states  are 
those  which  are  not  "free."  As  long  as  the  original  PI  yields  feasible 
policies,  we  do  not  use  branch  and  bound.  Failing  that,  we  maximize  over 
the  free  states  by  original  PI  and  over  the  coupled  states  by  branch  and 
bound.  This  algorithm  has  two  advantages.  First,  it  achieves  computa- 
tional efficiency  by  sticking  to  Howard's  PI  as  much  as  possible.  Sec- 
ondly, if  the  policy  yielding  the  highest  gain  in  the  absence  of  policy 
constraints  is  not  made  infeasible  by  the  introduction  of  these  con- 
straints, the  algorithm  detects  it  without  having  to  exhaust  the  feasible 
policy  set  (by  removing  successive  subsets).  This  is  of  particular  sig- 
nificance for  sensitivity  analysis. 

A Lagrange  multiplier  formulation  is  also  applied  to  the  risk-sensi- 
tive case.  Here,  the  objective  function  is  the  maximal  eigenvalue  of  the 
Q matrices  associated  with  the  feasible  policies.  The  constraints  are 
the  eigenvalue  problem  defining  the  maximal  eigenvalue,  plus  the  policy 
constraints.  As  in  the  risk-indifferent  case,  Howard's  and  Matheson's 
PE-PI  algorithm  is  shown  to  be  the  transformation  of  the  original  prob- 
lem, via  Lagrange  multipliers,  to  two  problems.  The  breakdown  of  the  PI 
when  policy  constraints  are  introduced  is  shown,  and  a similar  algorithm 
is  developed . 
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Finally,  sensitivity  analysis  is  considered.  As  mentioned  before, 
our  algorithm  is  capable  of  detecting  whether  or  not  the  policy  con- 
straints have  any  effect  on  the  optimal  policy  in  their  absence,  without 
having  to  solve  the  unconstrained  policy  problem.  The  value  of  removing 
individual  policy  constraints  is  explored  for  those  situations  where 
they  affect  optimal  policy  selection. 

C.  Outline 

Chapter  II  is  devoted  to  risk-indifferent  Markov  Dec  is  ion  Processes . 
After  reviewing  previous  work,  namely  Howard's  VD-PI  algorithm  and  the 
LP  formulation,  we  embark  upon  formulating  policy  constraints.  Then  the 
Lagrange  multiplier  formulation  is  outlined  and  pursued  to  its  conse- 
quences. This  leads  to  the  development  of  an  efficient  algorithm,  along 
the  VD-PI  lines,  whose  convergence  is  proved.  The  algorithm  is  applied 
to  Howard's  famous  taxi  cab  example  after  policy  constraints  are  intro- 
duced to  it.  All  the  foregoing  deals  with  completely  ergod ic  Markov  pro- 
cesses, in  which  all  states  are  recurrent.  We  outline  how  the  algorithm 
is  modified  when  it  encounters  coupled  states  which  are  transient.  We 
also  discuss  periodic  Markov  processes. 

Chapter  III  is  devoted  to  risk-sensitive  Markov  Decision  Processes. 
As  in  Chapter  II,  Howard's  and  Matheson's  PE-PI  algorithm  is  reviewed,  a 
Lagrange  multiplier  formulation  is  developed,  and  an  algorithm  emerges. 
Its  convergence  is  proved,  and  it  is  applied  to  the  same  previous  exam- 
ple with  a risk  aversion  coefficient. 

Chapter  IV  deals  with  sensitivity  analysis.  The  concepts  of  "con- 
straint-indifferent" and  "constraint-sensitive"  optimal  policies  are  in- 
troduced, and  a procedure  for  computing  the  worth  of  individual 
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constraints  is  outlined.  It  is  explained  by  applying  it  to  the  example 


solved  in  Chapter  II. 

In  Chapter  V,  we  discuss  modifications  of  the  algorithm  for  prob- 
lems having  a large  number  of  states  and  give  the  computational  results 
for  Howard's  baseball  problem.  We  also  make  some  suggestions  concerning 
future  research  . 


Chapter  II 


RISK- INDIFFERENT  MARKOV  DECISION  PROCESSES 
A.  Introduction 

In  this  chapter,  we  deal  with  risk-indifferent  Markov  decision  pro- 
cesses, where  we  progress  from  unconstrained  policies  to  constrained  ones. 

Section  B deals  exclusively  with  unconstrained  policies.  First,  How- 
ard's value  determination-policy  iteration  algorithm  (hereafter  referred 
to  as  VD-PI ) is  developed.  Then,  the  linear  programming  formulation  of 
the  problem  is  developed.  Most  of  this  section  appears  in  the  literature 
but  is  included  here  because  it  forms  the  foundation  on  which  the  results 
of  this  work  are  based . For  example,  the  linear  programming  formulation 
provides  us  with  the  mathematical  encoding  of  the  process  of  selecting 
one  alternative  in  each  state.  The  conditional  probability  d^  ot  se- 
lecting alternative  k,  given  the  system  is  in  state  i,  together  with 
the  important  result  that  all  d^’s  are  zero  or  unity,  enables  us  to  ex- 
press policy  constraints. 

Section  C deals  with  constrained  policies.  The  definition  of  what 
we  mean  by  constraints  on  the  policies  is  spelled  out.  We  mean  interac- 
tion, or  "coupling,"  between  alternatives  in  different  states,  such  as 

V 

the  selection  of  one  alternative  in  a certain  state  preventing  the  selec- 
tion of  another  alternative  in  some  other  state.  First,  we  deal  with 
"couplings"  between  two  alternatives  only,  and  we  show  that  all  such  cou- 
plings reduce  to  five  types  of  constraints.  The  general  case  is  treated 
by  the  algebra  of  events.  We  give  an  example  of  a 3-alternative  coupling 
and  show  how  one  of  the  2-alternative  couplings  can  be  derived  from  the 
general  case.  Then,  we  show  that  a constrained  policy  problem  can  be 
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reduced  to  a number  of  LP's,  which  is  unacceptable  on  account  of  that 
number  being,  more  often  than  not,  astronomical. 

The  approach  we  take  to  solve  the  problem  is  the  realization  that, 
even  in  the  absence  of  policy  constraints,  we  are  faced,  basically,  with 
a constrained  maximization  problem.  The  objective  function  is  the  gain, 
and  the  constraints  are  the  equations  defining  the  limiting  state  proba- 
bilities. We  consequently  use  a Lagrange  multiplier  (LM)  formulation  of 
the  problem  to  reduce  it  to  two  unconstrained,  iteratively  coupled,  prob- 
lems. One  of  them  turns  out  to  be  the  VD.  The  other  one  is  the  Maximi- 
zation of  the  Lagrangian  L over  the  discrete  set  of  feasible  policies. 
This  is  an  essentially  combinatorial  problem.  We  show  that,  in  the  ab- 
sence of  policy  constraints,  the  lack  of  "coupling"  facilitates  the  re- 
duction of  that  problem  to  a number  of  simple  discrete  maximization  prob- 
lems , yielding  Howard's  PI.  The  presence  of  coupling,  however,  in  the 
case  of  constrained  policies  destroys  the  reduction  feature.  Thus,  we 
seek  an  efficient  means  for  solving  the  combinatorial  problem  of  maxi- 
mizing L over  the  discrete  set  of  feasible  policies. 

In  Section  C,  we  also  point  out  the  fact  that  maximizing  L per  se 
in  PI  does  not  guarantee  selection  of  a policy  having  a higher  gain, 
i.e.,  policy  improvement.  Rather,  the  fact  that  L and  g (the  gain 
we  are  trying  to  maximize)  have  the  same  value  at  the  optimum  and  after 
each  VD  justifies  trying  to  increase  L.  Improving  the  policy  has  to  be 
guaranteed  outside  the  Lagrangian  framework.  This  we  do  by  introducing 
a sufficient  condition  for  improving  the  policy.  This  condition,  which 
was  derived  by  Howard  [4,51,  is  satisfied  by  the  maximization  of  L when 
no  policy  constraints  are  present.  Actually,  it  is  also  sufficient  to 
guarantee  the  VD-PI  convergence  to  an  optimum  policy. 
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In  Section  D,  we  develop  an  algorithm  for  solving  the  problem,  when 


i 


faced  with  policy  constraints,  on  the  basis  of  the  LM  formulation  of  Sec- 
tion C.  The  algorithm  is  composed  of  the  usual  VD , plus  a new  PI  which 
maximizes  the  gain  over  subsets  of  the  set  of  feasible  policies.  It  is 
based  on  the  branch  and  bound  (BB)  technique  for  solving  combinational 
problems  [2,3,71.  One  such  method  is  adapted  to  our  problem,  and  we  show 
that  it  converges  to  a policy  that  maximizes  the  gain  over  the  set  of  all 
feasible  policies  differing  with  it  in  exactly  one  state.  This  set  is  im- 
mediately removed  from  further  consideration  by  a simple  constraint,  thus 
reducing  the  set  of  feasible  policies.  To  increase  computational  effi- 
ciency, we  introduce  the  notions  of  "free"  and  "coupled"  states.  A "free" 
state  is  one  in  which  no  alternative  is  coupled  with  any  other  alterna- 
tive in  any  state,  i.e.,  not  involved  in  any  policy  constraints.  A "cou- 
pled" state  is  one  that  has  at  least  one  alternative  in  it  "coupled"  with 
some  alternative  in  another  state,  i.e.,  involved  in  some  policy  con- 
straint (s).  Our  PI  is  invoked  only  if  regular  PI  yields  an  infeasible 
policy.  In  this  case,  the  free  states  are  maximized  by  regular  PI  and 
the  coupled  states  by  branch  and  bound.  In  either  case,  the  sufficient 
condition  of  Section  C is  satisfied,  and  we  have  an  improved  policy.  This 
has  a further  advantage.  If  the  policy  constraints  do  not  make  the  opti- 
mum policy  (without  constraints)  infeasible,  then  it  can  be  detected  once 
it  is  encountered,  and  we  do  not  have  to  exhaust  the  feasible  policy  set 
to  reach  the  optimum. 

The  convergence  of  the  developed  algorithm  is  proved  in  Section  D. 

In  Section  E,  we  apply  the  algorithm  to  Howard's  famous  taxicab  ex- 
ample after  some  policy  constraints  are  imposed  on  it. 
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Sections  B through  E deal  with  policies  that  do  not  result  in  tran- 


sient states;  all  states  are  recurrent.  In  Section  F,  we  address  the 

problem  of  transient  coupled  states.  We  also  consider  periodic  Markov 

processes,  and  from  them  we  infer  that  the  manner  in  which  we  handle 

transient  coupled  states  and  how  we  obtain  an  initial  feasible  policy 

are  both  cases  where  we  profess  complete  ignorance.  In  the  former  case, 

the  zero  value  of  it.  for  a transient  state  obliterates  our  accumulated 
1 

knowledge  about  that  state,  as  far  as  the  lagrangian  is  concerned.  In 
the  latter  case,  the  lack  of  an  initial  feasible  policy  is  equivalent 
to  complete  ignorance  of  the  Markov  process  we  are  dealing  with. 

B . Markov  Decision  Processes  without  Policy  Constraints 

The  objective  here  is  to  select  a stationary  policy  that  maximizes 
the  average  return  per  transition  of  the  completely  ergodic  system,  where 
all  states  are  recurrent,  if  it  is  allowed  to  make  many  transitions, 
i.e.,  over  an  infinite  time  horizon. 

This  is  achieved  by  the  value  determination-policy  improvement  al- 
gorithm, which  computes  values  for  a given  policy,  then  obtains  a better 
policy,  until  the  optimum  policy  is  obtained. 

1 . Value  Determination-Policy  Improvement  Formulation 
a.  Value  Determination 

We  start  out  with  a finite  time  horizon,  i.e.,  allow  the 
system  to  make  only  n transitions,  then  extend  the  horizon.  We  denote 
the  expected  total  earnings  in  the  next  n transitions  if  the  system  is 
in  state  i by  v^(n).  To  compute  this  quantity,  we  note  that,  if  a 
transition  is  made  to  state  j,  its  value  will  be  the  r_  earned  by  the 


f 
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transition  plus  the  amount  earned  by  starting  in  state  j with  one  tran- 


sition  fewer  remaining,  i.e.,  v (n  - 1)  . Thus,  the  previous  amount  must 
be  weighed  by  the  probability  of  making  the  transition  from  i to  j , 
i.e.,  p^  . Since  the  transitions  from  i are  mutually  exclusive, 
v^(n)  is  simply  the  sum  of  the  weighed  quantities.  In  other  words, 


v . (n) 
1 


N 

v 


> . . fr . . 

i.l  L i.l 


v (n 

.1 


1) 


i = 1 N 

n = 1,2,3,  ... 


(2.1) 


If  we  define  the  immediate  expected  reward  for  a transi- 
tion from  state  i by 


^ = 


N 

V 

.1=1 


p.  .r.  . 
i.l  i.l 


i = 1, 


(2.2) 


we  can  write  Equations  (2.1)  as 


v . (n) 

l 


qi 


N 


.1=1 


P.  .v  .(n  - 1) 

i.l  .1 


i = 1 , • • • , N 

n = 1 ,2  ,3  , 


(2.3) 


It  can  be  shown  [4,51  that,  for  a completely  ergodic  process,  the  asymp- 
totic behavior  of  (2.3)  is  given  by 


vi(n)  = ng  + v±  1 = 1,2,  ...,  N (2.4) 

where  v is  the  "relative  value"  of  being  in  state  i,  and 


Hence,  g is  the  average  return  per  transition  of  the  system  if  it  is 
allowed  to  make  many  transitions  under  a given  policy.  Such  a policy 
is  stationary  in  the  sense  that  it  does  not  depend  on  n,  i.e.,  if  we 
find  ourselves  in  a given  state,  we  select  a particular  alternative, 
irrespective  of  n.  We  are  seeking  a policy  which  maximizes  this  gain 
g.  Once  a policy  is  determined,  the  rt's  and  q's  are  available,  and 
hence  (2.5)  gives  us  the  gain  of  that  policy.  However,  we  have  no  means 
of  finding  a better  policy,  if  one  exists.  The  key  to  this  lies  in 
(2.4).  For  large  n, 

vi(n)=ng+vi  i = 1,2,  ...,N 

We  also  have  that  (2.3)  holds  for  all  n : 


v . (n) 


qi 


p.  .v  . (n  - 1) 
i.l  .1 


i = 1,2, 


N 


Thus,  for  an  infinite  time  horizon,  we  can  substitute 
(2.4)  into  (2.3)  to  get 


i = 1,2,  ...,  N (2.6) 


which,  by  virtue  of 


N 

J'=1  PU 


1 , is  reduced  to 


N 

+ vi = qi + ~ pi.r.i 


i = 1,2, 


(2.7) 
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Here,  we  have  a set  of  N simultaneous  linear  equations 

in  the  N variables  v and  g,  a total  count  of  N +1.  We  notice 

i 

that  adding  a constant  c to  each  v in  (2.7)  gives 


g + v.  + c = q.  + ' p.  .(v.  + c) 

i i f— : 1.1  .1 

.1=1 


i.e. , 


g+V.  = q.  + 7 p..v. 

11  pi  1 


(2.8) 


(2.9) 


But  (2.9)  are  the  original  equations  (2.7).  Hence,  the  true  values  of 

v have  no  real  significance  in  processes  with  infinite  horizons.  It 
i 

is  the  differences  between  the  v^'s  that  matter.  This  is  shown  by 


vi (n)  = ng  + v1 


(2.10) 


v j (n)  = ng  + v ^ 


(2.11) 


whence 


v.  (n)  - v .(n)  = v.  - v . 
l J i .1 


(2.12) 


Thus,  setting  any  one  of  the  v ’ s equal  to  zero,  usu- 
ally v , and  solving  (2.7)  gives  us  the  gain  of  the  given  policy  and 
a set  of  v’s  we  call  the  relative  values  of  the  policy.  Those  are 
used  to  select  a policy  having  a higher  gain  than  the  given  one. 


b . Policy  Improvement 


Here,  we  also  start  with  a finite  horizon,  then  extend 
it  by  applying  (2.4).  If  we  define  v (n)  as  the  total  expected  return 
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in  n stages,  if  we  start  in  state  i and  an  optimal  policy  is  fol- 
lowed, then  applying  the  principle  of  optimality  of  dynamic  programming, 
we  have  for  any  n 


v . (n  + 1)  = max  ^ pk . |"rk . + v . (n)l 

1 k jtl  iJL  « J J 


n = 0,1,2,  ...  (2.13) 


This  may  be  written  as 


k \ k 

v (n  +1)  = max  q.  + p.  .v  ,(n) 
1 k 1 j=l  11  1 


n = 0,1,2,  ...  (2.14) 


Thus,  if  we  have  an  optimal  policy  up  to  stage  n,  we  can 
find  the  best  alternative  in  state  i at  stage  n+1  by  maximizing 


qk  + ^ pk  .v  . (n) 
1 lJ  J 


over  all  alternatives  k in  state  i.  For  an  infinite  horizon,  we  sub- 
stitute for  v^(n)  from  (2.4)  to  obtain 


k \ k , . 

q . + p . . (ng  + v . ) 

1 j~l  1J  J 


as  the  test  quantity  to  be  maximized  in  each  state. 


The  fact  that  p_  = 1,  irrespective  of  k,  reduces 


this  to 


k \ k 


q.  + ^ P.  .v  + ng 
.1=1  ’ J 


Since  ng  does  not  depend  on  the  policy  that  is  selected  , 


it  is  sufficient  to  maximize 


k 

qi  + 


N 

\ k 

P.  .v  . 

i=i  11  1 


(2.15) 


over  all  alternatives  k in  state  i.  Thus,  for  each  state  we  select 
an  alternative  k,  and  this  results  in  a new  policy  P.  Thus , given  a 
policy  A,  we  solve 


by  setting 
of  policy 
native  k 


\ 

AAA  \ A A 
Vi  + g = qi  + 1 PijV.i 


A A 

v =0  to  obtain  v , i = 1,2,... 
N j 

A.  Then,  using  the  v's  of  policy 
in  each  state  i to  maximize 


,N  - 1 and  the  gain  g 
A,  we  select  an  alter- 


k 

qi  + 


N 

\ k A 
P.  .v  . 
>1  1J  1 


The  alternatives  k make  up  the  new  policy  B,  say.  If 
it  is  identical  to  A , it  is  the  optimum  policy.  Otherwise,  a new 
iteration  is  started. 


2 . Linear  Programming  Formulation 

The  Markov  decision  process  can  also  be  formulated  as  a linear 
programming  (LP)  problem.  To  do  this,  we  first  recall  that  the  function 
to  be  maximized  is 
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1 


K 


N 

V 

i=l 


it . ( P)  q.(P) 

1 i 


(2.16) 


where  (2.16)  is  merely  (2.5)  rewritten  so  as  to  emphasize  the  dependence 
of  the  re's  and  q's  on  the  policy  and  where  the  maximization  takes 
place  over  all  possible  policies. 

The  next  thing  we  do  is  to  introduce  a set  of  new  variables 
k k 

d..  Each  d.  is  the  conditional  probability  of  selecting  alternative 
k,  given  that  the  system  is  in  state  i.  (Those  variables,  hence,  have 
to  have  a value  of  zero  or  unity.  This,  however,  will  be  proved  to  re- 
sult from  the  basic  properties  of  linear  programming,  rather  than  setting 
it  as  a constraint.)  Hence,  in  any  state  i,  the  expected  immediate 
reward  q.(P)  is  the  sum  of  the  q^  that  result  from  selecting  the 
various  alternatives  k in  state  i,  weighted  by  the  probabilities  of 
selecting  those  alternatives,  i.e., 


\ k ,k 
qt(P)  = \ q di 

kTl 


(2.17) 


whence  our  objective  function  becomes 


N Ki 

\ \ jt.  (P)  q .d  . 

7 i li 

i=l  k=l 


(2.18) 


L 


Here,  the  it's  and  d's  are  variables,  whence  our  function 
is  no  more  linear.  However,  using  the  definition  of  conditional  proba- 
bility, and  denoting  the  .joint  probability  of  being  in  state  i,  and 
selecting  alternative  k by  x.,  and  recalling  that  it^  is  the  steady 
state  probability  of  being  in  state  i,  we  get 
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which  gives 


k 


i rt.  (P) 
1 


xk  = rt.(P)  dk 
11  1 


(2.19) 


Tlie  constraints  on  the  d^  follow  from  their  being  probabili- 


ties 


V < - 1 


i = 1,2,  . . . , N 


(2.20) 


k=l 


dk  > 0 
i — 


Now,  the  original  constraints  on  the  " ' s were 


N 

\ 


/ TT  (P)  p.  .(P)  - ^(P)  = 0 
i“l  1 lj  J 


.1  = 1,2 N (2.21) 


N 


j=l 


it.(P)  = 1 


(2.22) 


Using  (2.19),  it  can  be  shown  [91  that  our  linear  programming 


problem  is 


N Ki 


V'  V'  k k 

\ \ q1*1 


(2.23) 


i=l  k=l 


subject  to 


K 


..  K . 

x ' \ 1 k k \ • k 
\ \ p.  .x.  - \ x.  = 0 


(2.24) 


i=l  k=l 


k=l 
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K KJ 


IIS' 

j=l  k=l 


Now  we  proceed  to  prove  that  this  LP  yields  values  for  d^, 
which  are  either  zero  or  unity,  whence  the  d^^  become  the  mathematical 
encoding  of  selecting  one  alternative  in  each  state. 


Theorem  2.1. 

Any  basic  feasible  solution  to  the  LP  defined  by  (2.23)  through 

(2.25)  has  the  property  that  for  each  i,  there  is  only  one  k such 
k k 

that  x.  >0  and  x.  = 0 otherwise. 

l l 


Proof. 


For  the  completely  ergodic  process  rank,  (I-P)  = N-l.  Thus,  one 
of  the  constraints  (2.24)  is  redundant,  and  the  rank  of  the  constraints 
is  N.  From  the  basic  properties  of  linear  programming,  it  follows  that 
any  basic  feasible  solution  has  N positive  variables  x^  with  the  rest 
of  the  variables  zero.  Now,  let  us  look  at  the  equations  of  the  con- 
straints in  detail  (Table  2.1).  Because  -p  (i  ^ j)  is  negative  and 

(1-p^)  is  positive,  it  follows  that,  in  each  of  the  first  N equa- 

k k 

tions,  there  has  to  be  at  least  one  x^  associated  with  a term  (1-p..) 
which  is  not  zero,  e.g.,  in  the  first  equation  if  x^  = 0 for  k = 1,2, 
...,k,  then  the  x^  which  are  not  zero  (i.e.,  positive)  are  all  multi- 
plied by  negative  coefficients  and  hence  sum  up  to  a negative  number, 
contradicting  the  value  of  the  R.H.S.  Also,  the  fact  that  the  first  N 
equations  contain  a redundant  one  does  not  change  the  fact  that  it  has 


to  be  satisfied.  Thus,  for  each  i,  there  has  to  be  at  least  one  x >0 

i 
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for  some  k . 


If,  for  some  state  i,  more  than  one  x^  > 0 , then  there 

remains  less  than  (N-l)  nonzero  x^^  for  the  remaining  (N-l)  states. 

This  would  mean  that,  for  some  i,  all  xi  = 0,  contradicting  the  fact 

that  at  least  one  such  xi  > 0.  Thus,  for  each  i,  there  can  be  at  most 

one  k such  that  xk  > 0 . "At  most  one"  and  "at  least  one"  mean  "only 

i 

It 

one . 

The  following  corollary  to  this  theorem  provides  us  with  the 
result  we  sought  to  prove . 

Corollary . 

Any  basic  feasible  solution  to  the  LP  defined  by  (2.23)  through 
(2.25)  yields  a pure  stationary  strategy,  i.e.,  for  each  i,  d^=l  f°r 
some  k and  zero  for  all  other  k. 

Proof . 

Equations  (2.19)  give 


K. 

l 

V 

k=l 


k 

x.  = Jt. 
i i 


V 

k=l 


dk 

1 


Substituting  (2.20)  in  (2.26), 


k=l 


Substituting  (2.27)  in  (2.19), 


K. 

l 

y 

k=l 


k 

x. 

X 


(2.26) 


(2.27) 


(2.28) 
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i .e.  , 


,k  i 
di  = T. 

V k 

£x‘ 


i = 1,2,  . . . , N 


(2.29) 


The  theorem  states  that  for  any  given  i, 


xk  = 0 
l 


f . „ 

x.  >0 

l 


1 < i < K. 

— — l 


(2.30) 


Hence , 


k l 

\ x.  = x. 

l l 

k=l 


(2.31) 


Thus  , 


.k  k t 

d . = x . /x 

l i i 


(2  .32) 


Whence,  for  k ^ Z,  d = 0,  and 


.£  1,1. 

d . = x,/xl  = 1 
l li 


(2  .33) 


i.e.,  only  one  alternative  is  chosen  in  each  state.  This  important  re- 
sult will  be  used  when  extending  the  Markov  decision  process  to  the  case 
where  the  policies  are  constrained. 


C.  Markov  Decision  Processes  with  Policy  Constraints 


Formulation  of  Policy  Constraints 


Our  point  of  departure  here  is  the  LP  formulation  for  the  un- 
constrained policies  case.  Rewriting  (2.23)  through  (2.25)  after  sub- 
stituting from  (2.19)  and  (2.22),  we  get  our  original  nonlinear  problem: 
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: 


i 


i 

I ' 

i 


N 


K. 

l 


k.k 
l.d  . 
i z'  i i 

i=l  k=l 


V «t.  V d .d  . 


(2.34) 


subject  to 


Tt  . - 


n j. 

V *.  V P^d*  = o 


u i 


i=l  k=l 


j = 1,2 N 


(2.35) 


N 


K. 


V *.  V d.  = 1 

i i 

i=l  k=l 


(2.36) 


v < = i 


k=l 


dk  > 0 
1 — 


i = 1,2 N 


(2.37) 


Note  that  (2.34)  through  (2.37)  define  the  same  problem  as 
(2.23)  through  (2.25).  Hence,  whatever  applies  to  (2.23)  through  (2.25) 
applies  to  (2.34)  through  (2.37).  Specifically,  we  know  beforehand  that 
in  the  solution  of  (2.34)  to  (2.37)  the  d are  either  zero  or  unity, 
with  di  = 1 for  only  one  k in  each  state  i.  The  significance  of 
this  will  become  apparent  later. 

Now,  we  introduce  constraints  on  policies.  By  constraints,  we 
mean  interaction,  or  coupling,  between  alternatives  in  different  states. 
For  example,  it  might  happen  that  selecting  alternative  j,  when  the 
system  is  in  state  i,  prevents  the  selection  of  alternative  l in  state 
k.  Thus,  any  policy  having  P(i)  = ,j  and  P(k)  = ( is  nonfeasible.  As- 
suming that  the  mathematical  encoding  of  alternative  selection  is  valid, 
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i.e.,  d.  is  zero  or  unity  and  is  unity  for  only  one  k in  each  i, 
the  above  constraint  may  formally  be  expressed  as 

dJ  + dj  < 1 (2.38) 

1 k — 

k i c 

(2.38)  plus  d.  = 0 or  1 imply  that  no  more  than  one  of  d.  and  d 

1 IK 

can  be  unity.  Of  course,  they  can  both  be  zero. 

Now  we  consider  the  encoding  of  policy  constraints  in  general. 
First,  we  handle  constraints  that  only  couple  two  alternatives  in  differ- 

i ^ 

ent  states,  i.e.,  d.  and  d,  , for  example.  We  shall  hereafter  refer 

l k 

to  such  constraints  as  binary  constraints.  We  will  show  that  no  matter 
how  the  constraint  is  stated,  it  reduces  to  one  of  five  relations. 


Theorem  2.2. 

Any  policy  constraint  consisting  of  an  interaction  between  alterna- 
tive j in  state  i and  alternative  £ in  state  k can  be  expressed 
as  one  oi  the  following: 


a + b > 1 

a - b > 0 

a + b < 1 

a + b = 1 

a - b = 0 


(2.39) 

(2.40) 

(2.41) 

(2.42) 

(2.43) 


where 


and 


and  both  are  either  zero  or  unity. 
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Proof . 


We  go  about  proving  the  above  by  simply  exhausting  all  possibili- 
ties.  Since  a and  b can  both  have  only  one  of, .two  values,  the  pair 
(a,b)  cannot  have  more  than  four  values,  and  any  constraints  merely 
limit  the  number  of  values  that  pair  can  have.  Thus,  we  translate  the 
constraint  as  outlawing  certain  of  those  values.  First,  we  deal  with 
the  trivial  cases. 

Allowing  all  values  (i.e.,  outlawing  none)  is  equivalent  to  saying 
that  we  have  no  constraints,  while  outlawing  all  four  values  is  a con- 
tradiction. The  pair  (a,b)  is  assured  to  exist  and  belong  to  the  set 
((0 ,0)  , (0 ,1) , (1 ,0) , (1 ,1)  } . Outlawing  three  values  and  only  allowing  one 
is  a case  where  we  do  not  need  any  constraints.  This  is  because  we  are 
saying  that  a value  has  been  assigned  to  both  a and  b.  If  the  value 
of  a is  zero,  say,  it  means  that  alternative  ,j  in  state  i is  not 
allowed.  Thus,  we  .just  discard  it.  (Actually,  this  is  a contradiction 
on  the  part  of  the  decision  maker.  On  the  one  hand  he  is  saying  that 
there  is  a number  of  alternatives  available  in  state  i,  and  on  the 
other  hand  he  is  saying  that  one  of  those  alternatives  does  not  exist.) 
Likewise,  if  the  value  of  a is  unity,  this  means  alternative  j will 
always  be  selected  in  state  i,  whence  we  should  discard  all  other  al- 
ternatives in  that  state.  (Yet,  another  contradiction;  hereafter,  when- 
ever the  value  of  a or  b is  predetermined  by  a constraint,  we  will 
consider  that  to  be  a contradiction  and  point  it  out.)  What  applies  to 
a applies  to  b in  the  foregoing.  Hence,  we  are  left  with  two  cases, 
namely  those  where  only  one  or  two  pair  values  are  outlawed. 

Inequality  (2.39)  outlaws  (0,0)  and  allows  the  three  other  possible 
values.  This  constraint  can  be  stated  as  follows:  any  policy  has  to 
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have  either  a or  b,  or  both.  (2.40)  outlaws  the  pair  (1,0)  which  in 
plain  English  says  that,  if  alternative  b is  not  selected,  then  neither 
can  a.  (If  the  constraint  is  the  other  way  around,  i.e.,  not  selecting 
a prevents  selection  of  b,  merely  renaming  a and  b makes  (2.40) 
applicable.)  Inequality  (2.41)  outlaws  (1,1)  which  is  the  type  of  con- 
straint we  already  discussed  (2.38).  This  exhausts  the  case  where  only 
one  value  of  the  pair  (a,b)  is  outlawed.  Equation  (2 .42 ) outlaws  (0,0) 
and  (1,1);  in  effect,  it  says  that  at  least  one  of  a and  b must  be 
selected,  but  the  selection  of  one  prevents  selecting  the  other.  Equa- 
tion (2.43)  outlaws  (1,0)  and  (0,1);  the  type  of  constraint  which  says 
that  selecting  (nonselecting)  one  alternative  necessitates  selecting 
(nonselecting)  the  other.  There  remain,  however,  four  combinations  of 
values  that  have  not  been  outlawed  by  any  of  (2.39)  through  (2.43).  We 
show  that  they  represent  contradictions.  Outlawing  (0,0)  and  (0,1)  means 
that  the  only  feasible  values  are  (1,0)  or  (1,1).  But,  here,  a = 1, 
which  we  previously  showed  represents  a contradiction  on  the  part  of  the 
decision  maker.  Likewise,  outlawing  (1,0)  and  (1,1)  leaves  us  with  (0,0) 
or  (0,1)  implying  a = 0.  In  the  same  manner,  outlawing  (0,0)  and  (1,0) 
implies  b = 1,  while  outlawing  (0,1)  and  (1,1)  implies  b = 0. 

If  the  policy  constraint  involves  more  than  two  alternatives  inter- 
acting with  each  other,  we  resort  to  the  algebra  of  events  to  obtain  a 
logical  (or  Boolean)  expression  for  the  constraint  and  then  transform  it 
into  an  algebraic  constraint.  An  example  illustrates  this.  Suppose  we 
have  three  alternatives,  the  selection  of  each  being  denoted  by  the 


events  A,  B,  and  C,  respectively  (each  alternative  being,  of  course, 
in  a different  state).  Not  selecting  an  alternative  will  be  denoted  by 
the  complement,  e.g.,  A'.  The  values  of  the  d'^  will  be  denoted  by 
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r 


a ,btc. 


Assume  that  the  constraint  is  that  A and  B cannot  occur  si- 


multaneously unless  C also  occurs.  This  means  that  ABC'  is  outlawed. 
Hence,  the  Boolean  expression  that  has  to  be  true  is 

(ABC' )'  = A'  + B'  + C (2.44) 

If  the  values  of  a,  b,  and  c are  to  represent  the  events  A,  B, 
and  C,  then  the  values  representing  A',  B' , and  C'  are  (1-a), 
(1-b),  and  (1-c),  respectively  (since  a,b,c  can  only  be  zero  for 
nonselection  and  unity  for  selection).  The  Boolean  expression  (2.44)  is 
false  only  if  all  of  its  components  are  false  (i.e.,  of  value  zero). 
Thus,  algebraically  we  want  the  corresponding  values  to  sum  to  something 
other  than  zero.  This  means 

(1  - a)  + (1  - b)  + c > 1 
-a  - b + c > -1 

a+b-c<l  (2.45) 


i. 


f 

i 


Two  things  have  to  be  noted  here.  First,  if  the  reduction  of  the 
Boolean  expression  to  its  minimal  sum  involves  intersections  of  events, 
then  the  algebraic  constraint  corresponding  to  it  will  involve  products. 
Secondly,  (2.45)  was  derived  by  requiring  the  L.H.S.  representing  (2.44) 
to  be  greater  or  equal  to  one.  This  is  because  truth  of  any  component 
is  sufficient  to  establish  the  truth  of  the  whole  expression,  whence  the 
truth  of  more  than  one  component  causes  the  sum  to  exceed  unity.  This 
does  not  hold,  however,  if  the  Boolean  expression  is  an  exclusive  OR. 
There,  only  one  component  can  be  true,  whence  the  sum  of  va lues  can  never 
exceed  unity.  In  this  case,  the  equivalent  of  (2.45)  is  derived  by  set- 
ting the  sum  equal  to  unity.  New,  we  formally  derive  the  previous. 
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We  are  interested  in  "translating" a Boolean  expression  representing 


combinations  of  events  into  an  algebraic  expression.  By  "translation," 
we  specifically  mean  that  we  are  seeking  an  algebraic  expression  which 
holds  if,  and  only  if,  the  corresponding  Boolean  expression  is  true.  To 
this  end,  we  start  by  defining  algebraic  variables  to  correspond  with  the 
events.  Since  an  event  X has  only  two  possible  values  (true  and  false, 
representing  the  event's  occurrence  or  lack  of  it),  we  define  an  associ- 
ated algebraic  variable  x which  can  only  take  on  the  values  1 and  0. 
Thus : 


Definition . 

Let  X be  an  event.  Its  associated  algebraic  variable  is  a real 
number  x restricted  to  the  values  1 and  0 such  that  X is  true  if  and 
only  if  x = 1. 

Hereafter,  we  will  denote  the  values  true  and  false  by  T and  F. 


Proposition  2.1. 

If  X is  an  event  whose  associated  algebraic  variable  is  x,  then 
the  algebraic  variable  associated  with  X',  the  complement  of  X,  is 
1 - x . 


Proof . 
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Assume 


Y = T 


I 


I 


L 


Then  X = F -=>  x / 1 =5>  x = 0 > y = 1 - x = 1 

Hence  Y = T > y = 1 

Assume  y = 1 

Then  x = l-  y = 0^1  > X = F > Y = T 

1 .e . , y = 1 > Y = T 

So,  we  have  proved  that  Y = T y = 1 which  is  the  definition  of  the 
associated  algebraic  variable. 

The  importance  of  Proposition  2.1  is  that,  whenever  we  have  the 

» 

complement  of  an  event,  we  can  substitute  the  "algebraic  complement"  of 
its  associated  algebraic  variable.  In  other  words,  if  whenever  we  en- 
counter X'  we  set  Y = X’  in  the  Boolean  expression,  then  the  result- 
ing y in  the  algebraic  expression  can  be  set  to  1-x.  This  enables 
us  to  only  consider  uncomplemented  events,  without  loss  of  generality. 
What  follows  applies  in  general  if  the  mentioned  substitutions  are  made. 

Now  we  consider  a Boolean  expression  composed  of  the  sums  (OR's) 
of  products  (AND's).  First,  we  consider  products. 

Proposition  2.2. 

B = ABC  . . . Z is  true  abc  ...  z =1 


Proof . 

ABC  ...  Z = T A=B  = C=  ...=Z=T  from  rules  of  Boolean  Algebra 
a =b=c  = ...  -7.  =1  from  definition 

36 


Proposition  2.3 


B=B  + B,  + ...  + B is  true  (->  b + b^  + . . . + b >1 
12  N 12  n — 

where  B is  a Boolean  product  of  events  and  b.  is  the  corresponding 
l l 

product  of  the  associated  algebraic  variables. 

Proof . 

From  the  rules  of  Boolean  Algebra , we  have 

B = T ]i  3 B.  = T b.  = 1 

l l 

If  only  one  such  i exists,  then  the  b^^  sum  to  unity;  otherwise, 
their  sum  exceeds  unity. 

Corollary . 

Let  B = B,  + B„  + . . . + B where  B.  is  the  Boolean  product  of 
12  N l 

events . Let 


I = {1,2,  ....  N) 


Hence,  there  cannot  exist  more 


But  this  contradicts  B.B.  = F,  V.  .. 

i j i , J 

than  one  i 3 B.  = T.  In  this  case, 

1 


B = T <-■>  3,  3 B.  = T and  B. 

li  J 


3,  3 b.  =1  and  b. 
ii  .1 


F 

0 


for  j / i 
for  j 4 i 


N 


The  preceding  is  the  case  of  an  exclusive  OR.  A special  case  of 
this  is  when  only  one  of  N events  is  allowed  to  be  true.  This  can  be 
detected  by  the  special  form  the  Boolean  expression  takes.  It  is  formed 
of  the  sum  of  N products,  where  each  product  is  formed  of  an  event 
and  the  complements  of  the  remaining  N-l  events.  For  example,  for 
three  events  A,  B,  C: 


AB'C'  + A'BC’  + A'B'C 

In  this  case,  a + b + c = 1.  This  is  because  the  only  way  the  Boolean 
expression  can  be  true  is  for  only  one  event  to  be  true,  and  the  rest 
false.  This  happens  if,  and  only  if,  one  associated  algebraic  variable 
is  unity  and  the  rest  are  zero.  For  instance,  (2.43)  may  be  derived  in 
this  fashion.  Here,  the  two  alternatives  A and  B are  either  both 
selected  (1,1)  or  both  not  selected  (0,0).  The  Boolean  expression  for 
this  is 

AB  + A'B' 

which  is  an  exclusive  OR.  Moreover,  it  is  of  the  special  form  we  .just 
illustrated.  Putting  C = B' , we  get 
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AC'  + A'C 


Then,  the  algebraic  expression  is 


Since 


we  get 


i.e,  , 


a + c = 1 


C = B 1 > c = 1 - b 


a + 1 - b = 1 


a - b = 0 


Note  that  the  translation'  is  not  unique.  The  general  procedure  out- 
lined in  the  proposition,  however,  always  yields  a valid  "translation." 
For  instance,  an  alternative  form  of  (2.43)  could  be  derived  by  applying 
the  general  procedure  to  the  Boolean  expression 


AB  + A 'B ' 


We  would  get 


ab  + (1  - a) (1  - b)  > 1 


i.e., 


2ab  - a - b > 0 


For  a and  b restricted  to  0 and  1,  this  inequality  defines  exactly 
the  same  constraint  as  (2.43)  (substituting  the  four  possible  values, 
verifies  this).  If  we  had  noted  that  the  Boolean  expression  is  an  ex- 
clusive OR,  we  would  have  obtained 


2ab  - a - b = 0, 


i 
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i .e . 


2 2 

2ab  - a - b = 0 , 


because 


2 


a 


Thus  , 

(a  - b)“  = 0 
a - b = 0 

which,  again,  is  equivalent  to  (2.43). 

Because  of  the  non uniqueness  of  "translation,"  we  suggest  that  the 
general  procedure  be  used  only  as  a last  resort,  in  order  that  we  get 
the  simplest  possible  constraints.  Unless  the  constraint  is  too  compli- 
cated to  intuitively  translate,  it  is  expressed  via  algebra  of  events. 

A check  is  made  to  see  if  it  is  of  the  special  form  mentioned  previously . 

k k 

If  so,  the  d^s  are  summed  to  unity.  Otherwise,  the  corresponding  d. 

k 

is  substituted  for  its  event  and  1-d^  for  the  complement  of  the  event 

If  the  Boolean  expression  is  an  exclusive  OR,  the  resultant  algebraic 

expression  is  equated  to  unity;  otherwise,  it  is  set  greater  than  or 

equal  to  unity.  In  this  manner,  any  policy  constraint  can  be  translated 

k 

into  an  algebraic  constraint  under  the  assumption  that  the  d involved 

l 

in  the  constraints  are  all  either  zero  or  unity. 

Note  that  the  general  procedure  for  translating  Boolean  to  alge- 
braic expressions  only  applies  to  sums  of  products. 


The  foregoing  then  implies  that  the  Markov  decision  process  with 
policy  constraints  can  be  formulated  as  (2.34)  through  (2.37)  plus  some 

k 

extra  constraints  on  a subset  of  the  d^. 

Now,  we  proceed  to  prove  an  important  result. 
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Proposition  2.4. 


For  the  problem 


max 


N i 

V «.  Y qV 

i=l  k=l 


subject  to 


(2.34) 


:.  - Y *■  Y Pk-dk  = 0 
j Z 1 Z id  1 


j = 1,2 N 


i=l  k=l 


(2.35) 


N Y k 

v Y V Z - 1 

i=l  k=l 


(2.36) 


K. 

l 


V'  k 

\ d = 1 


k=l 


i — 1 » 2 , •••»  N 


dk  > 0 
1 — 


(2.37) 


^(Z)  < 0 i = 1.2.  ....  m (i,k)  - Sj  | 

:p(di)  = ° p = l.2 (ifk)fsj 


(2.46) 


Where  and  S2  are  subsets  of  S = {(i,k)},  the  following  holds. 

If  d^  is  assumed  to  only  take  on  values  of  zero  or  unity  for 
(i,k)  G U S2,  then  d^  will  only  take  on  values  of  zero  or  unity 
for  (i ,k)  £ S . 


1 i 
^ 1 
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Proof . 


Since  the  set  S of  all  possible  alternatives  is  finite,  all  of  its 
subsets  are  finite.  Moreover,  restricting  d^  to  values  of  zero  or  unity 
makes  the  possible  number  of  ways  in  which  (2.46)  can  be  satisfied  fi- 
nite. In  other  words,  we  have  a finite  number  of  values  for  the  (m  + q)- 

tuples  representing  the  values  of  d.  for  (i,k)  £ S U S ; of  those 

1 1 ^ 

values,  only  a limited  number  will  satisfy  (2.46)  unless  the  constraints 
are  contradictory. 

Now,  for  each  (m+q)-tuple  satisfying  (2.46),  the  d^'s  involved 
have  certain  given  values  (zero  or  unity).  Substituting  those  values  in 
(2.34)  through  (2.37)  yields  an  identical  problem  with,  possibly,  fewer 
constraints.  The  structure  of  the  problem,  however,  is  the  same.  Thus, 
it  is  our  LP  problem  (2.23)  through  (2.25)  for  which  we  proved  that  the 
d^'s  are  all  zero  or  unity.  Hence,  our  problem  can  be  reduced  to  a fi- 
nite number  of  LP's,  each  defined  on  a subset  of  the  set  defined  by  the 
constraints  (2.35)  through  (2.37),  and  consequently  yielding  values  of 
zero  or  unity  for  the  d.’s  not  involved  in  the  constraints  (2.46). 

l 

The  foregoing  implies  that  we  can  formulate  the  constrained  policy 
problem  as  a finite  number  of  LP's  corresponding  to  the  number  of  ways 
(2.46)  can  be  satisfied.  We  could  then  solve  each  problem  (either  as  an 
LP  or,  even  better,  using  the  VD-PI  algorithm),  and  the  optimum  policy 
would  be  that  belonging  to  the  problem  yielding  the  highest  gain.  This, 
of  course,  is  unacceptable  from  a computational  point  of  view.  Not  only 
is  the  sub-problem  of  determining  how  many  LP's  we  have  a combinatorial 
one,  but  also  the  number  of  LP’s  we  would  have  to  solve  could  be  astro- 
nomical. That  is  why  we  proceed  to  use  the  Lagrange  multiplier  method 
to  reduce  our  constrained  problem  to  two  unconstrained  ones. 
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Lagrange  Multiplier  Formulation 


The  Lagrange  multiplier  rule  for  constrained  maximization  prob- 
lems provides  us  with  a powerful  technique  for  reducing  the  constrained 
problem  to  a number  of  unconstrained  ones. 

One  form  of  the  Lagrange  multiplier  rule  is  the  following  [8]: 
Let  X*  d E*1  maximize  f(x)  subject  to 

C\(x)  =0  i = 1,2,  m 


Then  there  exist  real  numbers  AT,  i = l,2,...,m  such  that 


the  point  (x*,x* x* , A* > • • • = (x*,A*)  € E is  a critical 

point  of  the  function 


m 

L(x,A)  = f(x)  + ^ A.C. Cx) 

i=i  1 1 


(2.47) 


VL(x* ,A* ) = 0 


Moreover , 


L(x*,A*)  = f(x*) 


The  appealing  feature  of  this  form  of  the  Lagrange  multiplier  rule  is 
that  it  transforms  the  constrained  maximization  of  the  function  f into 
finding  critical  points  of  the  "Lagrangian"  L,  as  defined  by  (2.47), 
which  is  unconstrained.  Of  course,  the  cost  incurred  here  is  the  in- 
crease of  dimensionality  from  n to  n+m.  However,  the  Lagrangian  of- 
fers the  possibility  of  an  iterative  algorithm.  A set  of  A's  is  chosen 
for  a point  x;  then  x and  A are  changed  successively,  until  we  get 
to  the  critical  point.  The  fact  that  at  the  solution  the  values  of  the 
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original  function  and  the  Lagrangian  are  equal,  plus  the  fact  that  we 


are  trying  to  maximize  such  a value,  can  be  used  to  move  from  one  set 
of  variables  to  a new  one.  We  make  the  increase  of  L our  objective. 
Actually,  we  could  divide  the  variables  (x,X)  whichever  way  we  choose, 
alternatively  changing  each  set  until  we  arrive  at  the  solution.  How- 
ever, the  convergence  of  such  an  iterative  algorithm  to  the  desired 
maximum  has  to  be  proved.  For  one  thing,  the  proposition  does  not  state 
that  any  critical  point  of  L is,  necessarily,  a constrained  maximum  of 
f.  Only  the  converse  is  guaranteed.  Moreover,  nothing  in  the  proposi- 
tion guarantees  convergence  even  to  a critical  point  of  L.  It  merely 
establishes  the  existence  of  such  a point  if  the  function  f has  a con- 
strained maximum.  With  this  in  mind,  we  proceed  to  apply  the  Lagrange 
multiplier  rule  to  solving  the  Markov  decision  process. 

First,  we  rewrite  the  form  of  the  general  problem  (including 
policy  constraints): 


(2.35) 


(2.36) 


(2.37) 


.("!)  - 
K)  - 


l = 1,2,  m 


(i,k)  " S 


P = 1.2 q (i,k)  C s 


where  and  St;>  are  subsets  of  the  set  S = { ( i ,k ) } of  (i,k)  pairs 

defining  each  alternative  in  each  state. 

We  will  treat  the  problem  as  follows.  Since  we  have  already 

shown  that  (2.46)  merely  restricts  us  to  subsets  of  S,  and  that  set- 

k k 

ting  d.  = 0 or  1 for  (i,k)  ~ u results  in  the  rest  of  d^  be- 
ing unity  or  zero,  we  will  consider  (2.34)  through  (2.37)  and  then,  when 
we  change  the  d^,  we  will  take  (2.46)  into  consideration  and  only 
choose  zero  or  unity  for  those  d^  involved  in  (2.46). 

Corresponding  to  (2.34)  through  (2.37),  we  have  a Lagrangian 
L involving  2N  + 1 multipliers  A.  We  will  name  them  according  to  the 
constraints,  whence  they  will  later  acquire  significance.  For  (2.35), 

we  define  v, , . . . ,v  (i.e.,  A,  , . . . ) . For  (2.36),  we  define  the 

IN  IN 

multiplier  g (i.e.,  A ) . For  (2.37),  our  multipliers  (X  

N +1  N +2 

A0>,  i)  will  be  named  3,  ,...,3., • 

Hence,  our  Lagrangian  is 


N Ki  . , N 

*■  - V \ >'  -K  * v ", 


i=l  k=l 


3=1 


N Ki  „ k 

V *.  V pk.dk  - * 

1 1J  1 3 

i=l  k=l 


+ g 

N 

1 - V 

Ki  *1 

«i  v -i 

N 

+ V Pi 

R - J 

i=i 

k=l 

i=) 

! 

1th 
] 11 

(2.48) 
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r 


the 


It  is  a function  of  the  limiting  state  probabilities  it.  , 

conditional  probabilities  d^  of  selecting  alternative  k given  state 

i,  and  the  Lagrange  multipliers  v^  , g,  and  (3.  . 

N 

All  in  all,  we  have  3N  + K.  + 1 variables  (the  dimension 

1=1  l 

of  the  Euclidean  space  over  which  L is  defined).  Our  objective  is  to 
find  a critical  point  for  L.  This  we  do  iteratively.  Starting  with 
d^'s  that  satisfy  (2.37),  we  set  the  partial  derivatives  of  L with 
respect  to  the  it's  equal  to  zero.  This  gives  values  for  the  v's  and 
g (and  also,  as  we  will  show,  for  the  it's;  this  is  equivalent  to  set- 
ting the  partial  derivatives  w.r.t.  the  v's  and  g to  zero).  Then,  we 
use  the  LP  result  which  tells  us  that  we  know  beforehand  that  the  d 

i 

are  zero  or  unity.  The  way  we  use  it  is  to  change  the  d^  in  the  zero- 
unity  subspace,  rather  than  set  partial  derivatives  w.r.t.  d equal  to 

l 

zero.  The  partial  derivatives  of  L w.r.t.  the  re's  is 


K . K . K . 

jL  v1  k k \ \ 1 k k v1  k 

= \ q.d.  - v.  + \ v.  \ p..d.  - g \ d. 

3?t  ill  j ij  i i 

1 k=l  ,i=l  k=l  k=l 


, N 


(2.49) 


Given  a policy  P,  i.e.,  values  of  d^  satisfying  (2.37)  (and 
for  constrained  policies,  (2.46)  also),  we  get  only  one  k for  each  i, 
and  (2.49)  reduces  to 


N. 

= qi(P)  " VI  + ~ Pi.i(P)  Vj  " g 


i = 1,2 N (2.50) 


Setting  the  derivatives  equal  to  zero  gives  us 
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. ...  N 


(2.51) 


qi 


V 

pi 


P.  .V  . 

ij  .1 


= g + v 


i = 1.2, 


but  (2.51)  Is  the  VD  phase  of  the  VD-PI  algorithm.  Thus,  that  phase  is 
one  in  which  Lagrange  multipliers  are  updated.  In  (2.51),  we  have  N 
equations  in  N + 1 unknowns.  However,  since  (2.35)  contains  a redun- 
dant equation,  any  one  of  (2.35)  can  be  discarded,  which  is  equivalent 
to  setting  one  of  the  v's  to  zero,  whence  (2.51)  becomes  a nonsingular 
system  of  equations.  In  matrix  form,  it  can  be  written  as 


_1  - P11  ~P12  -P1N  1_ 

1 

1 

& 

1 

“P21  1 " P22 -p2N  1 

V2 

q2 

PN-1 , 1 PN-1 ,2 PN-1,N  1 

VN 

• 

PN , 1 PN , 2 1 PNN  1_) 

g 

>_ 

Regarding  the  multiplication  of  a matrix  by  a vector  as  taking 
a linear  combination  of  the  matrix  columns,  the  above  states  that  we  are 
attempting  to  form  the  q vector  as  a linear  combination  of  the  n + 1 
columns  of  the  matrix  on  L.H.S.  The  elements  of  the  combination  are  the 
v's  and  g. 

Since  v.  is  set  to  zero,  however,  this  means  that  we  can  drop 
N 

the  N column  to  get  our  N X N system  of  equations: 
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Denoting  the  N X N matrix  on  the  L.H.S.  by  P,  we  get 


Now  we  proceed  to  show  that  the  rr's  can  be  obtained  as  a by- 
product of  VD  (which  has  been  compactly  stated  by  (2.52)),  If  ,ve  differ- 
entiate the  Lagrangian  w.r.t.  the  v's  and  g,  we  get  (2.35)  and  (2.36), 
i .e . , 


* 


For  a given  policy  P,  if  we  set  the  partial  derivatives  equal 
to  zero,  we  get  our  original  definition  of  limiting  state  probabilities, 
namely, 


it  . - p.  .it  = 0 j = 1 ,2 N 

3 ea  11 1 


y «. . i 


Since  the  first  N equations  contain  a redundant  one,  we  can 
drop  the  Nth  equation  and  get  N equations  in  the  N unknown  it^.  In 
matrix  form,  this  may  be  written  as 


1 - P-, 


1 - Pn 


_P1 ,N-1  “P2,N-1 


-p  it  0 

*N,N-1  N-l 

1 « 1 

N 


But  the  matrix  on  the  L.H.S.  is  the  transpose  of  P.  Hence, 
the  it's  are  the  solution  of 


, T 0 
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Hence,  the  it's  are  the  last  column  of  ( CP  1 — A ) , i.e.,  the 
transpose  of  the  last  row  of  P \ Since  we  compute  P 1 in  the  course 
of  VD , this  means  that  we  actually  have  the  it's  without  any  additional 
computational  effort  whatsoever.  The  significance  of  this  will  become 
apparent  when  we  consider  policy  constraints.  Hence,  setting  the  partial 
derivatives  of  L w.r.t.  the  v's  and  g equal  zero,  actually  means 
setting  all  its  partial  derivatives  (except  for  those  involving  the  d^'s 
and  their  3's)  equal  to  zero. 

Now,  we  proceed  to  interpret  the  Lagrange  multipliers.  (2.53) 
and  (2.48)  imply  that  the  v's  are  involved  in  L in  the  form 


V oL 

v 

M J *3 


where  dL/Jv^  gives  the  amount  by  which  the  j constraint  on  IT  is  vi- 
olated. It  is  necessary  that  it  be  zero  for  all  j.  Otherwise,  we  do 
not  have  a critical  point  for  L,  whence  it  can  never  be  a constrained 
maximum.  Hence,  v^  gives  us  the  cost  of  violating  the  constraint 

by  one  unit.  But  that  constraint  represents  the  equilibrium  of  proba- 
bilistic flows  in  the  steady  state.  Thus,  v^  is  the  value  of  being  in 

state  j (albeit  a relative  one).  If,  in  that  state,  the  equilibrium 

of  probabilistic  flows  does  not  hold  (e.g.,  by  virtue  of  using  P 's 
belonging  to  a policy  different  than  the  one  the  it's  were  computed 
for),  v^  gives  us  the  cost  per  unit  of  "disequilibrium"  for  that  state . 

It  might  be  that  we  gain,  in  terms  of  the  value  of  L,  by  doing  that, 

i.e.,  we  increase  L.  However,  there  is  no  guarantee  that  when  we  com- 
pute the  new  it's  and  v's  we  will  get  a net  increase  in  L.  This  will 
be  explained  shortly. 


i 
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The  interpretation  of  g is  straightforward.  From  (2.52),  g 


I 


is  the  inner  product  of  the  last  row  in  P * and  the  q vector.  By 
virtue  of  (2.55),  that  row  is  merely  the  transpose  of  the  it  vector. 
Hence , 


n 


N 

T \ 

• q = A Vi 

1=1 


But  this  is  the  value  of  our  original  function  we  are  trying  to  maximize 
Hence,  for  a given  policy,  the  value  of  the  gain  is  one  of  the  Lagrange 
multipliers  that  make  the  aforementioned  partial  derivatives  equal  to 
zero . 

The  VD,  therefore,  results  in  equating  some  of  the  partial 
derivatives  of  L to  zero  for  a given  policy.  The  remainder  of  those 
partial  derivatives  are  not  necessarily  zero  though.  We  would  like  to 
equate  them  to  zero.  Rather  than  do  that,  however,  we  can  do  better. 
We  already  know  that  d^'s  have  to  be  zero  or  unity  for  unconstrained 
policies  (and  for  constrained  policies  if  we  set  the  extra  ones  to  zero 
or  unity).  Hence,  it  would  be  more  efficient  to  look  upon  this  part  of 
the  maximization  process  as  a discrete  problem  and  try  to  increase  the 
value  of  L (see  [11  for  example).  To  do  this,  we  rewrite  (2.48),  re- 
arranging  the  terms,  so  as  to  bring  out  the  dependence  on  d^. 


1 


r 


Our  aim  is  to  select  the  such  that  we  get  the  largest 

increase  possible  in  L.  Since  we  will  be  selecting  them  as  zero  or 
unity,  according  to  our  prior  knowledge  for  unconstrained  policies  (and 
our  forcing  them  for  constrained  ones),  and  since  the  jt^  are  held  con- 
stant during  improvement  of  the  policy,  we  need  only  concern  ourselves 
with  the  first  term  in  L.  The  second  one  does  not  involve  d^,  and 
the  last  two  vanish.  Thus,  we  concentrate  on  maximizing  the  quantity: 


i=l 


(2.56) 


(2.56)  is  the  inner  product  of  two  N-component  vectors.  The  first  is 
the  vector  IT  (P) , as  determined  by  the  current  policy  P.  Its  compo- 
nents are  nonnegative.  The  second  vector  is  a variable.  For  each  pol- 

K * 

icy  P' , where  P' (i)  = k' , the  d^  satisfy  (2.37),  whence  the  sum- 

k ' k ' 

mation  over  k reduces  to  one  term  for  each  state,  namely  t^  = (q^  + 
N k'  th 

jZj  p_v  J the  * component  of  the  vector  selected  by  P' . We  will 
denote  that  vector  by  T(P').  Hence,  the  maximization  of  (2 .56)  reduces 
to  selecting  that  vector  T(P'),  i.e.,  that  policy  P'  which  yields 

the  largest  value  of  inner  product  with  the  constant  vector  n(P)  . This 
is,  essentially,  a combinatorial  problem.  The  absence  of  policy  con- 
straints reduces  it  to  a much  simpler  problem.  In  the  absence  of  such 

constraints,  all  possible  vectors  T(P')  are  allowable  (i  .e  .,  feasible)  . 
N 

There  are  11^  such  vectors.  Among  them,  is  that  vector  for  which 


k 

i 

f 

A 


t.  = 
i 


max 


k=l , 2 , 


qi  + 


V 

>1 


k 

pi.ivi 
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i = 1,2,  . . . , N (2.57) 


The  maximization  of  (2.57)  yields,  for  each  i,  an  alternative 
K^.  Those  alternatives  make  up  a policy  P with  the  corresponding 
vector  T(P*).  Now  consider  the  inner  product  of  TCP*)  and  TI(P).  Since 
each  component  of  T(P*)  is  greater  than  the  corresponding  components 
of  all  other  T(P)  and,  since  the  components  of  E are  nonnegative, 
it  immediately  follows  that  P*  is  the  policy  that  maximizes  (2.56). 
Hence,  for  unconstrained  policies,  the  combinatorial  problem  of  maxi- 
mizing (2.56)  reduces  to  the  N discrete,  "uncoupled,"  maximization 
problems  (2.57).  But  (2.57)  is  the  PI  phase  of  the  VD-PI  algorithm. 
Thus,  that  phase  is  actually  the  maximization  of  L with  respect  to 
the  d^,  using  the  values  of  E and  V for  the  current  policy.  This 
can  be  represented,  schematically,  as  in  Figure  2.1.  There,  we  grouped 
the  variables  into  three  axes.  The  P's  are  represented  by  an  axis, 
as  are  the  v's  and  it's.  Since  the  v's  do  not  exist  in  the  original 
constrained  problem  (2.34)  through  (2.37),  its  set  of  feasible  points 
lies  in  the  (E,P)  plane.  Moreover,  since  the  constra ints  yield  unique 
values  for  E,  the  problem  reduces  to  selecting  from  a discrete  set  of 
points  in  that  plane.  Regarding  the  Lagrangian,  for  each  policy 

P,  it  is  a function  of  IT  and  V.  However,  the  VD  results  in  points 
whose  IT  is  identical  to  that  of  the  given  policy,  whence  the  points 
a.  have  the  same  E component  as  the  points  A^.  As  shown  in  Figure 
2.1,  the  PI  consists  of  moving  from  a^,  say,  along  the  P direction, 
holding  IT  and  V constant,  to  maximize  L.  This  results  in  point 
b , say.  The  VD  then  takes  us  to  a , from  which  we  go  into  the  PI 


A word  of  caution  is  necessary  here.  It  pertains  to  setting 


up  the  increase  in  L as  a criterion  for  selecting  a policy  having  a 
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higher  gain  than  the  current  one.  Firstly,  the  fact  that  L is  maxi- 
mized along  the  "ray"  emanating  from  a point  a^  does  not  guarantee 
that  the  resultant  a^  will  give  the  highest  gain  possible  on  this 
iteration  (i.e.,  the  best  improvement). 

Referring  to  Figure  2.2,  the  PI  surveys  the  points  b9 

through  b,_  which  lie  on  the  "ray"  emanating  from  a^.  It  then  selects 

that  point  b.  at  which  L is  maximum.  For  Figure  2.1,  b happens 
1 

to  be  that  point.  The  VD  then  gives  a9.  However,  it  cannot  be  proved 
that  another  point  b^ , say,  at  which  L is  less  than  at  b9  , will 
necessarily  yield  a point  a 3 for  which  the  gain  is  less  than  a9 . In 
other  words,  the  ordering  of  the  values  of  L at  b.  is  not  necessarily 
identical  to  the  ordering  of  the  gain  values  at  the  corresponding  a^. 
Moreover,  merely  increasing  L along  the  a^  "ray"  does  not  guarantee 
a better  policy.  That  guarantee  is  to  be  provided  outside  the  Lagrangian 
framework,  as  we  shall  explain  later.  The  question  might  arise,  there- 
fore, of  whether  it  is  at  all  appropriate  to  maximize  L for  policy  im- 
provement. The  appropriateness  of  this  procedure  is  .justified  for  two 
reasons.  First,  we  know  that  at  the  optimum  the  value  of  the  constra ined 
function  we  are  trying  to  maximize  is  identical  to  that  of  L,  as  well 
as  at  all  points  a^.  Second,  since  we  are  trying  to  maximize  our  orig- 
inal function,  it  would  pay  to  try  increasing  L as  we  go  along.  This 
gives  us  an  insight  into  how  to  go  about  policy  improvement  when  there 
are  policy  constraints,  as  long  as  we  bear  in  mind  two  things.  The  first 
is  that  the  sole  purpose  of  PI  is  to  obtain  a policy  having  a higher 
gain , irrespective  of  how  much  higher  it  is  (e.g.,  it  might  happen  that 
applying  (2.57)  to  only  one  state  yields  a policy  whose  gain  is 

higher  than  the  policy  obtained  by  applying  (2.57)  to  all  states. 
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The  computational  cost  of  trying  to  detect  this  is  prohibitive.  Conse- 
quently, we  apply  (2.57)  to  at  least  one  state,  the  result  being  a pol- 
icy having  a higher  gain  than  the  current  one,  i.e.,  merely  an  improve- 
ment (not  necessarily  the  best  improvement)  . The  second  thing  to  bear 
in  mind  is  that  merely  increasing  L from  a^  to  b^,  say,  does  not 
guarantee  that  a^  is  a better  policy.  It  might  very  well  happen  that 

L increases  from  a . to  b . and  then  decreases  from  b . to  a . (where 

t 3 3 .1 

its  value  is  g)  such  that  L(a.)  < L(a . ) . 

3 i 

Now  we  reconsider  (2.56),  namely, 


V *.  V fqk  + V Pk.v  .)  dk 

2,  1 2,  p „ i]  ] i 

i=l  k=l  \ j=l  / 


(2.56) 


For  unconstrained  policies,  it  reduces  to  (2.57),  which  gave 
us  a policy  P*.  If  the  policy  constraints  do  not  make  P*  infeasible, 
then  we  select  P*.  Otherwise,  we  have  to  solve  the  combinatorial  prob- 


lem of  selecting  that  vector  T(P)  from  the  feasible  set  of  such  vectors 
(corresponding  to  the  feasible  set  of  policies),  which  maximizes  (2.56) 
and  guarantees  an  increase  in  the  gain.  If  we  rewrite  (2.56)  as 


i=l  k=l 


2 *i  I K + 2 p"j vj)  di = 


i 

i=l  k=l  \ j=l  / 


and  define 


n 

k k \ k 

t.  = q.  + / p.  .v. 

i i f-i  iJ  j 


we  find  that  we  are  trying  to  maximize 
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(2.58) 


N Ki 


v y vM 

i=l  k=l 


Since  d^  is  zero  for  all  k's  but  one  in  any  state  i,  and 
the  t^  are  the  PI  test  quantities  of  yesteryear,  (2.58)  tells  us  that 
we  are  trying  to  maximize  the  sum  of  those  test  quantities,  one  in  each 
state,  weighted  by  how  much  time  the  system  spends,  on  the  average,  in 
each  state.  This  makes  intuitive  sense.  Moreover,  maximizing  the  indi- 
vidual components  of  a sum,  automatically  maximizes  the  whole  sum.  In 
the  absence  of  policy  constraints,  it  is  possible,  as  we  have  shown,  to 
maximize  the  individual  components.  We  do  not  have  to  worry  about  feas- 
ibility. In  this  case,  the  n become  mere  scaling  factors  common  to 
the  components  we  are  trying  to  maximize,  whence  they  can  be  ignored. 
Once  we  introduce  constraints  on  the  policies,  however,  we  have  to  take 
feasibility  into  consideration.  The  states  become  "coupled"  through  the 
constraints,  such  that  it  might  not  be  feasible  to  maximize  the  individ- 
ual components  independently  of  one  another.  It  is  the  "noncoupling"  of 
states  which  allows  the  reduction  of  (2.58)  to  (2.57).  Thus,  when  policy 
constraints  are  present,  we  have  to  consider  the  sum  as  a whole  and  seek 
an  efficient  method  of  solving  the  nonreducible  combinatorial  problem. 
Here,  we  will  have  to  take  into  consideration  the  IT  of  the  current  pol- 
icy. This  would  seem  to  imply  additional  computations  (solving  N si- 
multaneous linear  equations)  per  iteration.  However,  this  is  not  so.  We 
have  shown  that  the  it's  are  already  there  as  a by-product  of  VD  ((2.52) 
and  (2.55)).  Thus,  taking  the  7t's  into  consideration  does  not  involve 
any  extra  computational  effort.  The  test  quantities  are  merely 
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multiplied  by  the  corresponding  -r's  before  we  embark  upon  our  maximi- 
zation.  That  maximization,  as  we  noted  earlier,  does  not,  of  itself, 
guarantee  an  improvement  in  the  gain.  However,  we  do  have  a sufficient 
condition  (developed  by  Howard  f4])  for  a gain  improvement  from  one  pol- 
icy to  another . 

Assume  that  we  have  a policy  P for  which  the  VD  has  been 
performed  (i.e.,  the  v's  and  ~ ' s computed).  Consider  any  other 
policy  P' . Each  policy  has  a vector  of  test  quantities  T associated 
with  it , where 


N. 

tk(P)  = qk(P)  + ^ pk.(P)  v .(P) 
X l ^ 


i = 1,2,  . . . , N (2.59) 


ri 

tk(P')  = qk (P ' ) + ^ pk,(P')  v . 

1 1 j=l  11  1 


(P)  i = 1,2 N (2.60) 


where  the  v^'s  are  those  obtained  from  the  VD  for  policy  P.  Specif- 
ically, 


qk(P)  + ^ pk.(P)  v.(P)  = v. (P)  + g(P) 
1 lj  1 1 


i = 1,2 N (2.61) 


Thus,  (2.59)  reduces  to 


t. (P)  = v. (P)  + g(P) 

l l 


(2.62) 


Had  we  solved  the  VD  for  policy  P' , we  would  have  had 


k(P’)  + / pk.(P’)  v.(P’)  = v.(P’)  + g(P’)  i = 1 ,2 , . . . ,N  (2. 


Z-  r'ii'*  ' ’i 
.1=1  J 


63) 
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Combining  (2.63)  and  (2.60),  we  can  immediately  write 


N. 

;k(P'  ) = v (P’)  + g(P')  + ^ pk  ,(P')  [v  .(P)  - v.(P')l 

l l | | t j [_  J ,1  J 


i = 1,2,  . . . , N 


(2.64) 


Now  compute  the  components  ? of  the  difference  vector  ?=T(P')  -T(P) 


from  (2.64)  and  (2.62): 


7k  = tk(P')  - tk(P) 
11  1 


= v (p')  + g(p')  + ^ Pk.(p-)rv.(P)  - v .(pm]  - v.(p)  - g(p) 

i >-i  ij  L J J J 1 


Setting 


Av.  = v. (P' ) - v.  (P) 
li  i 


£ g = g(P')  - g(P> 


and  rearranging  terms , we  get 


+ Av  = 7.  + 2-,  P--(p'^Av4 
1 1 «1  ^ J >3 


(2.65) 


But  (2.65)  has  exactly  the  same  form  as  (2.51)  (the  VD  equa- 
tions) where  the  correspondence  is 


g < — > Ag  , v < — > Av  , 7 < — > Q 


We  previously  showed  that  the  solution  for  g was  T.  ^ , 
where  the  rt's  are  the  limiting  state  probabilities  of  the  policy  under 
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consideration  . 


Therefore,  we  can  immediately  write  the  solution  of 


(2.65)  as 


g(P')  - g(P)  = /=g  = ^ n (P’>  7 

i=l  1 1 


(2.66) 


Since  the  it's  are  nonnegative,  (2.66)  states  that  the  suffi- 
cient condition  for  improving  the  gain  (from  P to  P')  is  that  be 

nonnegative  for  all  states  and  strictly  positive  in  at  least  one  state. 
Note  that  the  definition  of  the  test  quantity  differences  7 involves 
quantities  known  for  policy  P.  In  other  words,  we  do  not  have  to  solve 
the  VD  for  every  alternate  policy  P' . Without  knowing  IT(P'),  we  can 
guarantee  that  P'  is  better  than  P if  at  least  one  ->  ^ is  positive. 
We  refer  to  this  as  an  "improvement"  in  state  i.  Note,  however , that 
the  derived  condition  is  not  necessary.  There  might  very  well  be  another 
policy  P"  which  gives  positive  and  negative  7 's  in  different  states 
but  has  a H vector  which  makes  the  R.H.S.  of  (2.66)  positive.  We  can 
not  discover  this,  however,  without  solving  the  VD  for  P" . Thus,  the 
best  procedure  for  guaranteeing  a gain  improvement  is  to  improve  the  test 
quantity  in  at  least  one  state.  In  the  absence  of  policy  constraints, 
(2.57)  does  that.  In  the  presence  of  policy  constraints,  we  have  to  max- 
imize (2.58),  subject  to  improving  at  least  one  state.  As  mentioned  ear- 
lier, that  maximization  is  a combinatorial  problem.  Solving  it,  consti- 
tutes a modification  of  PI  to  handle  constrained  policies,  giving  us  the 


algorithm  we  are  seeking.  This  we  do  in  the  next  section. 


D . Development  and  Convergence  of  the  Algorithm 


As  explained  earlier,  the  algorithm  consists  of  the  VD  intact,  plus 
a modified  PI  to  handle  policy  constraints.  First,  we  address  the  com- 
binatorial problem: 


max 
P € F 


N i 

V V 

i=l  k=l 


, k k 

Vidi 


(2.58) 


subject  to 


i = 1,2,  . . . , m (i,k)  € S 

P = 1,2 (i,k)  € S2 


(2.46) 


where  all  the  inequality  type  of  constraints  have  been  grouped  together, 
as  are  those  of  the  equality  type. 

F is  the  set  of  feasible  policies  defined  by  (2.46),  where  a pol- 
icy, by  definition,  means  selecting  one  alternative  in  each  state,  i.e., 
satisfying  the  constraints  (2.37). 

One  of  the  most  efficient  techniques  for  solving  combinatorial  prob- 
lems is  the  branch  and  bound  method  (or  the  multiple  choice  programming) 
[2,3,71.  Here,  the  space  we  are  optimizing  over  is  divided  into  subsets 
in  such  a manner  1 iat  only  a portion  of  the  whole  space  is  examined.  We 
will  develop  a method  based  on  these  concepts.  Central  to  this  method 
are  the  concepts  of  branching  and  fathoming,  which  we  proceed  to  define. 
Branching  is  the  process  of  obtaining  one  or  more  points  from  a given 
infeasible  point  in  the  policy  space.  Fathoming  is  a property  of  a point 
being  considered.  If  no  branching  can  be  made  from  a point,  or  if  no 
benefit  is  going  to  result  from  such  branching,  the  point  is  said  to  be 


i-.ir-'y-: 


A 
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fathomed.  Tlius,  fathoming  is  basically  the  termination  of  branching. 
An  unfathomed  point  is  eligible  for  branching. 

We  will  illustrate  the  method  of  branching  and  bounding  by  consid- 
ering an  example.  Assume  that  we  have  three  states,  with  three  alterna- 
tives to  choose  between  in  each  state.  Assume,  furthermore,  that  the  VD 

has  been  carried  out  for  a given  feasible  policy,  resulting  in  the  test 

k k 

quantities  t^,  defined  by  (2.59).  We  will  consider  the  t multiplied 

by  the  corresponding  rr^'s  and  list  them  as  in  Table  2.2,  where  they  are 

, 13  2 

in  descending  order  in  each  state  (e.g.,  t9  > t9  > t9,  i.e.,  alterna- 

tive 1 in  state  2 maximizes  the  test  quantity)  . 


Table  2.2 

TABLE  OF  ORDERED  TEST  QUANTITIES 


Table  2.2  also  illustrates  the  policy  constraints.  Here,  we  assume 
constraints  of  the  simple  mutually  exlusive  type.  Corresponding  to  the 
"couplings"  in  Table  2.2,  we  have  the  constraints: 

d^  + d2  < 1 (2.67) 

dl  + d2  < 1 (2.68) 
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(2.69) 


! 


3 2 

d2  + d3  < 1 


3 1 

d2  + d3  — 1 


(2.70) 


Now  we  start  "branching  and  bounding"  to  maximize  (2.58),  subject 
to  (2.67)  through  (2.70)  (the  equivalent  of  (2.46)).  Since  a point 
"branches"  into  other  points,  we  will  have  a "tree."  We  will  also  have 
a lower  bound  for  the  optimum.  (The  lower  bound  is  initialized  to  the 
artificial  value  of  -oo.)  Whenever  branching  gives  us  a feasible  point, 
the  value  of  L (the  function  we  are  trying  to  maximize)  is  compared  to 
the  current  lower  bound  . If  it  exceeds  that  bound  , the  bound  is  updated  , 
and  any  point  yielding  a value  of  L which  is  lower  than  the  new  bound 
is  fathomed.  Feasible  points  are  fathomed  by  definition.  The  search 
terminates  when  no  more  unfathomed  points  exists.  We  start  out  with  the 
point  representing  that  T vector  whose  components  are  given  by  (2.57), 
i.e.,  the  largest  test  quantity  in  each  state.  Denote  it  by  T . Our 
tree  then  initially  consists  of  one  node 


where  TQ1  = (t^t^t^). 

The  corresponding  policy  is  PQ1  = (2,1,2),  i.e.,  selecting  alter- 
natives 2,  1,  and  2 in  states  1,  2,  and  3,  respectively.  This  is  the 
policy  the  PI  would  select.  However,  in  our  example,  it  is  infeasible 
(it  violates  (2.67)).  Hence,  our  lower  bound  remains  at  its  initial 
value  of  -co,  and  we  have  one  unfathomed  point  T^  to  branch  from. 
Branching  consists  of  selecting  each  alternative  in  state  1,  in  turn, 


A 
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i.e.,  changing  the  first  component  of  T^.  The  remainder  of  the  compo- 
nents are  chosen  such  that  they  are  the  largest  test  quantities  in  their 

states,  consistent  with  the  constraint  imposed  by  the  first  component, 

2 1 
if  any.  Thus,  selecting  t , e.g.,  would  prohibit  selecting  t^,  and 

3 

hence  we  have  to  select  ty  instead.  For  the  third  component,  we  can 

2 2 3 

select  t“  because  it  is  not  "coupled"  with  t . The  fact  that  t and 

o -L  ^ 

t“  are  coupled  is  postponed  to  the  next  level  of  branching,  if  we  get 
there  . Hence , our  tree  becomes 


where 


11 


12 


13 


2 4 


(*2. 

(*;• 


) 

) 

) 


The  corresponding  policies  are: 


Pu  = (2,3,2) 
P12  = (3,1,2) 

P13  = (1'1’2) 
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mi  war  m 


Only  P. 


is  infeasible,  whence  T 


and  T 


are  immediately 


i 


l 

> 


L 


11 


12 


13 


fathomed.  Moreover,  the  value  of  the  Lagrangian  at  those  two  points  is 
compared  to  the  lower  bound  (-<»)  and  to  that  at  T ^ . Assume  that 
L(T,,  ) > L(T  _ ) > L(T1(,).  In  this  case,  the  lower  bound  is  updated  to 

1 JL  lo  L — 


L = L(T 


13 


) 


and  T^  is  the  only  point  available  for  branching.  If  we  represent 
fathoming  by  "grounding"  the  point  in  the  tree,  the  situation  becomes: 


Moreover,  P is  the  optimum  policy  so  far.  Now  we  start  branch- 
ing from  T^  , as  we  did  from  . Here,  we  select,  in  state  2,  each 

alternative  in  turn.  However,  the  alternative  must  be  uncoupled  from 
2 

t^.  In  other  words,  at  each  level  in  the  tree,  we  have  a fixed  alter- 
native in  a number  of  states.  represents  level  0.  No  states  .ave 

fixed  alternatives.  The  next  level  of  T , T12’  and  T13  has 
alternative  in  state  1 fixed  (this  is  level  1).  Only  goes  down 

one  further  level  (to  level  2)  to  attempt  fixing  alternatives  in  state 

2,  consistent  with  the  constraints  imposed  by  the  alternative  fixed  in 

2 2 

the  previous  level.  Since  T has  t^  fixed,  and  t^  is  coupled 
with  t,^,  we  cannot  fix  the  latter.  Thus,  we  only  have  two  succt-  .:or 
points  to  T . The  remaining  components  are  selected  such  that  they 
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are  the  maximum  test  quantities  in  their  states,  under  the  restriction 

that  they  satisfy  any  constraints  imposed  by  fixing  the  previous  levels. 

Since  we  are  down  to  the  last  level,  the  points  we  obtain,  if  any,  have 

3 3 

to  be  feasible.  Thus,  fixing  t,,  imposes  selecting  t , while  fixing 

2 2 3 3 

tv  enables  us  to  select  t^  . Note  that,  if  tt)  were  coupled  with  t^ 

also,  we  would  only  have  one  successor  point  to  T . If,  in  addition, 

2 

t4)  were  coupled  with  all  alternatives  in  state  3,  no  branching  would 

be  possible  from  T^,  and  it  would  be  fathomed.  Those  are  interesting 

3 2 

cases  because  such  constraints  imply  that  t^  and  t4)  are  not  really 

alternatives  at  all;  they  can  never  be  selected  in  a feasible  policy. 

This  can  be  detected  by  manipulating  the  constraints  to  discover  that 
3 2 

they  impose  d 2 = d~  =0.  However,  we  are  more  inclined  to  let  the 
branch  and  bound  discover  this  (along  with  nonfeasible  problems).  Thus, 
at  this  step,  our  tree  would  become: 


where 


21 


T = 

22 


P21 


P22 


(2,3,3) 

(2,2,2) 
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The  * next  to  T Implies  that  this  is  the  best  point  obtained  so  far. 

Both  P and  are  feasible,  whence  we  compute  L(Tvl>  and 

L(T22)  and  compare  them  to  the  current  bound  (the  best  value  of  L so 
far).  Also,  both  points  are  fathomed.  Assume  that  L(T22>  > L(T21>  >L. 

In  this  case,  we  update  L,  set  ?22  as  our  optimum  so  far,  and 
survey  the  tree  for  unfathomed  points : 


Since  no  more  branching  is  possible,  T22  is  the  optimum.  If  we  reach 
the  end  without  encountering  any  feasible  points,  the  problem  is  unfeas- 
ible. This  is  detected  by  the  lower  bound  still  being  at  its  original 
value  of  -oo. 

Note  that  in  this  example  there  are  27  different  policies.  Of  those, 
only  17  are  feasible,  and  we  only  considered  6.  The  efficiency  of  branch 
and  bound  techniques  (BB)  results  from  the  manner  in  which  branching  and 
fathoming  are  implemented.  The  branching  takes  feasibility  into  account 
in  a piecemeal  fashion,  one  state  at  a time,  while  being  always  biased 
towards  sets  of  points  where  L has  larger  values.  This  gives  us  a 
chance  to  seize  upon  a feasible  policy  of  large  L relatively  quickly. 


Then,  the  bounding  eliminates,  via  fathoming,  whole  sets  of  points  from 


any  further  consideration. 

Now  we  embark  upon  proving  that  the  outlined  BB  method  maximizes 
the  Lagrangian  over  the  set  of  feasible  policies. 

First,  we  prove  that,  as  we  move  into  deeper  levels  in  the  tree, 
the  value  of  L cannot  increase. 

Proposition  2.5. 

If  branching  occurs  from  some  infeasible  node  I at  level  k in 
the  tree,  the  value  of  L at  the  resultant  nodes  cannot  exceed  that  at 
I . 

Proof . 

The  value  of  L at  any  node  corresponding  to  a policy  P is  merely 
the  sum  of  the  components  of  the  vector  T(P)  corresponding  to  that 
policy . 

Now  consider  T(I)  at  the  infeasible  node  I at  level  k.  Its 
first  k components  do  not  violate  any  constraints.  Moreover,  start- 
ing from  the  k +1  component,  each  component  is  maximum  in  its  state, 

subject  to  the  constraints  imposed  by  the  first  k components. 

Now  any  node  resulting  from  I has  a T which  agrees  with  T(I) 

in  the  first  k component.  The  k+1  component  is  any  one  in  state 

k+1,  subject  to  the  constraints  imposed  by  the  first  k components. 
Hence,  it  cannot  exceed  the  k+1  component  of  T(I).  Starting  from 

component  k + 2,  each  component  of  a branch  is  the  maximum  in  its 
state,  subject  to  the  constraints  imposed  by  the  first  k+1  compo- 
nents. They  might  be  the  same  as,  or  more  than,  the  constra ints  imposed 
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by  the  first  k components,  but  not  less.  Hence,  from  component  k+2, 
no  component  of  a branch  node  can  exceed  the  corresponding  component  of 
T(I).  Thus,  we  have  shown  that,  starting  from  component  k +1,  no  com- 
ponent of  a branch  node  can  exceed  the  corresponding  one  in  T(I).  This 
proves  the  proposition. 

Proposition  2.6. 

If  a given  feasible  policy  P is  not  in  the  BB  tree  and  there  ex- 
ists a fathomed  policy  P'  in  the  tree  at  level  k such  that  P'  and 
P agree  in  the  first  k components,  then  the  value  of  L at  P can- 
not exceed  the  optimum  value  obtained  by  the  BB . 

Proof . 

We  have  two  cases  to  consider: 

(a)  P'  is  feasible.  Since  P'  agrees  with  P in  the  first 
k components  and  the  rest  of  the  components  of  T(P') 
are  the  maximum  in  their  states  , sub ject  to  the  constraints 
imposed  by  the  first  k components,  then  no  component  of 
T(P)  can  exceed  the  corresponding  component  of  T(P'). 

(b)  P'  is  infeasible.  Assume  that  P'  is  fathomed  because 
no  branching  is  possible  from  it.  This  means  that  the 
first  k components  impose  constraints  which  make  all 
alternatives  in  state  k+1  infeasible.  But  P agrees 
with  P'  in  the  first  k components,  whence  they  impose 
the  same  constraints.  Hence,  the  k+1  component  of  P 
violates  some  constraint,  i.e.,  P is  infeasible.  But 
this  contradicts  the  assumptions.  Hence,  P'  was  fath- 
omed because  there  exists  some  node  F elsewhere  in  the 
tree  yielding  a larger  value  for  L than  node  P'  . Since 
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P is  obtainable  from  P'  by  branching,  Proposition  2.5 
says  that  L at  P cannot  exceed  that  at  P',  whence 
it  cannot  exceed  that  at  F. 


Thus,  in  (a)  and  (b) , we  have  shown  that  there  exists  a feasible 
policy  in  the  tree  where  the  value  of  L is  not  less  than  that  at  P. 
But  since  the  optimum  obtained  by  BB  is  the  largest  value  for  L over 
all  feasible  nodes  in  the  whole  tree,  the  proposition  is  proved. 


Proposition  2.7. 

If  a given  feasible  policy  P is  not  in  the  BB  tree  and  there 
exists  an  unfathomed  policy  I in  the  tree  at  level  k,  such  that  P 
and  I agree  in  the  first  k components,  then  there  exists  a policy 
P'  in  the  tree  at  level  k +1  such  that  P and  P'  agree  in  the 
first  k+1  components. 

Proof . 

Since  I is  not  fathomed,  branching  has  occurred  from  it.  Con- 
sider the  branch  nodes.  They  all  agree  with  I,  whence  with  P,  in 
the  first  k components.  Component  k + 1 takes  on  all  values  in 
state  k+1  such  that  the  constraints  imposed  by  the  first  k compo- 
nents are  not  violated.  But  component  k+1  of  P satisfies  the  same 
constraints  (because  it  satisfies  all  constraints).  Thus,  one  of  the 
branch  nodes  from  I agrees  with  P in  component  k + 1 , whence  it 
agrees  with  it  in  the  first  k+1  components.  Since  this  node  is  at 
level  k+1,  the  proposition  is  proved. 
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Proposition  2.8. 


If  a given  feasible  policy  P is  not  in  the  BB  tree  and  there  ex- 
ists a node  N in  the  tree  at  level  k such  that  N and  P agree  in 
the  first  k components,  then  the  value  of  L at  P cannot  exceed  the 
optimum  obtained  by  BB. 

Proof . 

Consider  the  node  N.  It  is  at  level  k and  agrees  with  P in  the 
first  k components.  If  N is  fathomed,  apply  Proposition  2.6. 

If  N is  not  fathomed,  apply  Proposition  2.7  repeatedly.  Every 
time  we  get  to  an  unfathomed  node  at  level  £ , there  is  a node  branch- 
ing from  it  at  level  £ +1,  agreeing  with  P in  £ + 1 components.  Fi- 
nally, we  reach  a fathomed  node  at  some  level  m < M (where  M is  the 
deepest  level  the  tree  reaches)  and  apply  Proposition  2.6. 

Proposition  2.9. 

No  feasible  policy  P can  yield  a value  of  L greater  than  the 
optimum  obtained  by  BB . 

Proof . 

If  P is  in  the  tree,  the  proof  is  trivial.  Consider  a feasible 
policy  P not  in  the  tree.  The  first  component  in  P is  an  alterna- 
tive in  state  1.  Now  look  at  level  1 in  the  tree.  It  has  as  many 
nodes  as  state  1 has  alternatives.  Each  node  has  one  alternative  in 
state  1 as  its  first  component.  Thus,  there  exists  a node  N in  the 
tree  at  level  1 such  that  N and  P agree  in  the  first  component.  We 
have  satisfied  the  assumptions  of  Proposition  2.8,  whence  its  result 
applies,  completing  the  proof. 


A 
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Now  that  we  have  a method  for  solving  the  combinatorial  problem  of 


maximizing  the  Lagrangian  over  the  set  of  feasible  policies,  we  have  to 
guarantee  an  improvement  in  the  gain.  This  we  do  by  guaranteeing  that 
the  quantity  defined  by  (2.66)  never  be  negative.  To  ensure  this,  we 
do  not  allow  negative  y 's  in  each  iteration.  To  illustrate,  we  first 
repeat  Table  2.2. 


State  Ordered  Test  Quantities 


Now  we  assume  that  we  entered  with  policy  (3.3.3).  We  want  to  dis- 
allow any  test  quantity  that  is  less  than  that  of  the  current  policy  in 
each  state  (whence  no  y can  be  negative).  In  our  case,  the  policies 
we  consider  in  BB  in  this  iteration  would  be  given  by  the  following  ta- 
ble . 


State  Ordered  Test  Quantities 


No  resulting  policy  can  have  any  negative  y^,  whence  (2.66)  can 
.\.r  negative,  i .e . , we  prevent  selection  of  a policy  with  lower 
\ * i . sumo  that  BB  yields  policy  (2,3,3).  Assume,  furthermore, 
r th.it  policy  results  in  the  following  table. 
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State 


Ordered  Test  Quantities 


1 3 

/H  ti 

\t3  t1' 

t2  Xtl" 


(Note  that  the  "coupling"  is  between  alternatives  in  different 
states,  not  between  test  quantities.  That  is  why  the  "couplings"  look 
different  in  this  table.)  Now  we  enter  BB  with  policy  (2,3,3),  and,  to 
restrict  the  7 , we  only  consider  the  following. 


Ordered  Test  Quantities 


1 3 2 

/ti  tx  t-L 

Vt3 


It  is  obvious  that  discarding  alternatives  reduces  the  amount  of 
computations  involves  in  each  iteration.  To  increase  the  efficiency, 
we  can  renumber  the  states  in  each  iteration  such  that  state  1 has  the 
least  number  of  alternatives.  The  obvious  question  is  what  if  the  BB 
results  in  the  same  policy  we  entered  with?  What  is  the  characteristic 
of  such  a policy?  In  the  following  theorem,  we  show  that  it  maximizes 
the  gain  over  a subset  of  the  feasible  policies. 


Theorem  2.3. 

If  the  maximization  of  the  Lagrangian  over  the  set  of  feasible 
policies,  subject  to  7 > 0 for  all  i,  yields  a policy  P for  which 

7 = 0 for  all  i,  then  that  policy  maximizes  the  gain  over  the  set 
of  all  feasible  policies  that  differ  with  it  in  eactly  one  state. 
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Proof . 


Let  P'  be  a feasible  policy  differing  with  P in  the  k compo- 


nant  (i.e.,  state).  Let  P(k)  = f and  P'(k)  = m.  Assume  that  tk  > 


Q "ttl 

t . Since  T(P)  differs  from  T(P')  only  in  the  k component,  the 
k 


Lagrangian  at  P'  is  greater  than  at  P.  This  implies  that  there  ex- 


ists a feasible  policy  P'  yielding  a value  of  L greater  than  the 


optimum  obtained  by  BB . But  this  contradicts  Proposition  2.9. 


Hence , 


t"  <t‘ 
k — k 


Thus  , 


y <0  and  y =0  for  i ^ k 
k — i 


y = T(P')  - T(P) 


Hence , 


g(P')  - g(P) 


I* 

= / rt.(P')  7. 


= Vk  ^ ° 


g(P')  < g(P) 


Thus,  any  feasible  policy  differing  with  P in  exactly  one  state  can- 


not have  a higher  gain  than  P. 


Thus,  when  BB  converges  to  a policy,  that  does  not  necessarily  mean 


that  it  is  the  overall  optimum.  What  we  propose  to  do  in  this  case  is 


to  make  that  policy,  and  all  policies  differing  with  it  in  exactly  one 
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state,  infeasible,  thus  removing  a whole  subset  from  further  considera- 
tion. This  can  be  achieved  by  adding  just  one  constraint  to  the  set  of 
policy  constraints.  If  the  policy  P is  given  by  P(i)  = k,  the  con- 
straint is 


(2.71) 


(As  a matter  of  fact,  (2.71)  is  a special  case  of  a general  form.  For 
example,  to  make  an  individual  policy  infeasible,  the  R.H.S.  would  be 
N-l.  In  general,  to  make  a policy,  and  all  those  differing  with  it  in 
exactly  M <N  states  infeasible,  the  R.H.S.  of  (2.71)  would  be  N - 
M - 1.) 

In  order  to  increase  computational  efficiency,  we  divide  the  states 
into  two  types.  The  "free  states"  are  those  which  are  not  involved  in 
any  policy  constraints,  i.e.,  their  d^'s  are  only  involved  in  rela- 
tions of  the  type  (2.37).  The  states  involved  in  relations  of  the  type 
(2.46),  i.e.,  having  alternatives  that  are  "coupled"  with  each  other, 
we  refer  to  as  "coupled  states."  Given  a feasible  policy  A for  which 
VD  has  been  performed,  we  first  attempt  a regular  PI  (maximizing  test 
quantities  over  all  states).  If  this  does  not  change  A,  we  have  an 
overall  optimum  policy.  If  the  resultant  policy  is  feasible,  we  start 
a new  VD.  Otherwise,  we  maximize  over  the  free  states  by  regular  PI, 
and  over  coupled  states  by  BB  with  7^  > 0 . This  results  in  a policy 
B.  If  B differs  from  A,  we  enter  VD.  Otherwise,  we  know  that  A 
maximizes  the  gain  over  the  set  of  feasible  policies  differing  with  A 
in  exactly  one  coupled  state.  In  this  case,  we  make  policy  A,  as  well 
as  all  policies  differing  with  it  in  exactly  one  coupled  state, 
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infeasible  by  a (2.71)  type  constraint  (here,  the  N in  (2.71)  would  be 
the  number  of  coupled  states).  Then,  we  enter  BR  again  to  maximize  the 
Lagrangian  over  the  currently  feasible  policy  set,  with  the  7 > 0 re- 

striction removed.  Basically,  we  are  looking  for  a feasible  policy,  re- 
respective  of  gain,  whence  it  makes  sense  to  use  the  latest  values  of 
t since  they  contain  a certain  amount  of  the  algorithm's  history  up  to 
this  point . 

Every  time  BB  converges  to  a subset  maximizer,  we  compare  its  gain 
to  the  best  previous  subset  maximizer  and  retain  the  one  with  the  higher 
gain.  As  a result  of  removing  subsets  over  which  we  maximize,  the  feas- 
ible policy  set  quickly  shrinks  until  it  becomes  empty.  When  BB  results 
in  an  infeasible  problem,  we  have  an  optimum  policy. 

To  get  an  initial  feasible  policy,  we  maximize  the  sum  of  the  imme- 
diate  expected  rewards  q^  over  the  feasible  policy  set  using  BB . Sche- 
matically, then,  our  algorithm  can  be  represented  in  Figure  2.3.  The 
convergence  of  this  algorithm  to  an  optimum  feasible  policy  is  readily 
proved  . 

Consider  a feasible  problem  (nonfeasible  ones  are  detected  at  the 
outset  by  exiting  from  El).  Now  consider  any  feasible  policy  P other 
than  the  one  selected  by  the  algorithm. 

If  we  exited  the  algorithm  from  E2 , then  policy  A has  the  highest 
gain  of  any  policy  (whether  feasible  or  not).  This  is  because,  for  A, 
each  test  quantity  is  the  maximum  in  its  state.  Any  policy  which  is  not 
identical  with  A has  to  be  different  from  it  in  at  least  one  state. 
Consequently,  any  policy  other  than  A results  in  nonnegative  y.'s, 
with  at  least  one  y strictly  negative.  Equation  (2.66)  then  implies 
g(P)  < g (A) . This  is  the  case  where  the  policy  constraints  do  not  affect 
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No  feasible  policies; 
problem  is  infeasible 


t 
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Fig.  2.3.  ALGORITHM  FOR  RISK- INDIFFERENT  CASE. 


I 
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the  feasibility  of  that  policy  yielding  the  highest  gain  in  the  absence 
of  any  such  constraints.  Thus,  we  have  retained  the  ability  to  detect 
such  a policy,  without  having  to  exhaust  the  feasible  policy  set,  by 
(2.71)  type  constraints. 

If  we  exited  the  algorithm  from  E3 , then,  at  that  point , the  given 
feasible  policy  P had  become  infeasible.  The  only  way  this  can  come 
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about  is  from  (2.71)  type  constraints.  Hence,  P belongs  to  some  sub- 
set over  which  we  have  maximized  the  gain,  whence  g(P)  cannot  exceed 
the  gain  of  the  policy  selected  by  the  algorithm. 

Since  no  feasible  policy  can  have  a gain  higher  than  that  of  the 
policy  selected  by  the  algorithm,  the  latter  is  the  optimum  feasible 
policy. 


E . The  Example 

We  will  take  Howard's  famous  taxicab  example  [41  and  add  some  pol- 
icy constraints  to  it.  The  taxi-cab  driver  works  in  an  area  encompas- 
sing three  towns  A,  B,  and  C.  In  towns  A and  C,  he  has  three  alterna- 
tives . 


1.  He  can  cruise  in  the  hope  of  being  hailed  by  a passenger. 

2.  He  can  drive  to  the  nearest  cab  stand  and  wait  in  line. 

3.  He  can  pull  over  and  wait  for  a radio  call. 


If  he  is  in  town  B,  alternative  3 is  not  available  because  there  is  no 
radio  cab  service  in  that  town.  For  a given  town  and  alternative,  there 
is  a probability  that  the  next  trip  will  be  to  each  of  the  towns  A,  B, 
and  C,  and  a corresponding  net  monetary  reward  associated  with  each  such 
trip.  If  towns  A,  B,  and  C are  identified  with  states  1,  2,  and  3,  then 
Table  2.3  gives  the  probabilities  and  rewards.  Now  we  introduce  the  con- 
straints. In  towns  A and  B,  there  is  a union  for  taxi-cab  drivers.  The 
union  owns  the  cab  stands  in  both  towns  and  the  radio  cab  service  in  town 
A.  Nonmembers  are  denied  the  use  of  union  facilities.  Union  membership, 
however,  has  strings  attached  to  it.  To  become  a member,  a driver  has 
to  use  the  facilities,  except  that  he  can  only  use  one  cab  stand  (either 
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Table  2.3 


PROBABILITIES  AND  REWARDS  FOR  THE  TAXICAB  EXAMPLE 


Rewards 


in  town  A or  town  B to  give  other  members  a chance).  Thus,  if  our  friend 

joins  the  union,  alternative  1 (cruising)  is  not  available  for  him  in 

states  1 and  2,  whereas  if  he  does  not,  alternative  1 becomes  the  only 

1 2 

available  one  in  both  states.  In  other  words,  either  d^  = d._,  = 0 or 
dj  = d^  = 1.  This  is  a type  (2.43)  constraint.  Specifically, 


(2.72) 


Also,  since  he  cannot  select  alternative  2 in  both  states  1 and  2, 
no  matter  what,  we  have  a (2.41)  type  constraint.  Specifically, 


2 2 
dx  + d2  < 1 


(2.73) 
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Our  friend  wants  to  select  a policy  that  yields  the  highest  gain,  subject 
to  those  constraints. 


r 


The  first  step  in  the  algorithm  is  to  set  the  optimum  gain  g*  = -oo 
and  get  an  initial  feasible  policy.  We  use  BB  to  maximize  the  sum  of  im- 
mediate expected  rewards  over  the  feasible  set.  Our  first  node  in  the 
tree  is  the  one  that  picks  the  maximum  q^  in  each  state: 


T01  = (8-16>7)  P01  = <1.1.1) 


P01  is  feasible,  whence  T()1  is  fathomed,  and  we  have  an  initial  feas- 


ible policy  A = (1,1,1). 


Performing  the  VD  for  A gives: 


Ordered  Test  Quantities 


1 t*  = 4 .213^,  t^  = 3.373  t^  = 2.207 


2 tf2  ^1.323StJ  = 3.333 


3 t = 3.907 


t*  = 3.680  t^  = 2.387 

o *5 


and  g (A)  = 9.200 


Constraint  (2.72)  is  represented  by  a double  line  coupling  the  two  alter- 
natives involved,  while  (2.73)  is  represented  by  a single  line.  Since  the 
alternatives  in  state  3 are  not  coupled  to  any  other  alternatives,  state 
3 comprises  the  set  of  "free  states"  we  defined  earlier.  The  dotted  line 


separates  the  free  states  from  the  coupled  ones. 


The  regular  PI  gives  us  policy  (1,2,2)  which  is  infeasible;  it 
violates  (2.72).  Hence,  we  maximize  over  the  free  state  to  get  P(3)  = 
2.  For  the  coupled  states,  we  enter  BB  with  7 > 0,  i.e.,  the  follow- 
ing table: 


State 

1 

2 


Ordered  Test  Quantities 


t 


1 

1 


t 


2 

2 


\ 


t 


1 

2 


The  first  node  is  the  maximum  in  each  state,  an  infeasible  one: 


TQ1  = (4.213,4.323)  PQ1  = (1,2) 


To  branch  to  the  next  level,  we  fix  alternatives  in  state  1.  Since  we 
only  have  one  alternative,  we  only  have  one  branch.  Also,  d^  = 1 and 
(2.72)  make  d^  = 1,  whence  we  can  only  select  alternative  1 in  state 
2.  This  makes  the  branch  node  feasible,  whence  it  is  immediately  fath- 
omed . 


T = (4.213,3.333)  P = (1,1) 
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Therefore,  P(l)  = 1 and  P(2)  = 1.  Thus,  we  emerge  with  a policy  B = 

(1.1.2) ,  different  from  A.  So , we  call  this  new  policy  A,  i.e.,  A = 

(1.1.2)  and  perform  the  VD  for  A.  This  results  in: 


State  Ordered  Test  Quantities 


3 t^  = 2.741  t*  = 2.620  t^  = 1.617 

A = (1,1,2)  g (A)  = 9.366 


Maximizing  in  all  states,  i.e.,  regular  PI,  results  in  (1,2,2)  which  is 

infeasible.  It  violates  (2,72).  Thus,  we  maximize  in  3 to  get  P(3)  = 

2,  and  we  enter  BB,  for  the  coupled  states  only,  discarding  any  alterna- 

tlves  having  t.  less  than  the  one  we  entered  with  (i.e.,  y.  > 0). 

l i — 

Thus,  what  we  consider  is  given  by: 


State  Ordered  Test  Quantities 


The  third  component  of  our  policy  will  always  be  2,  so  we  only  write  the 
first  2 for  brevity. 

Our  first  node,  as  usual,  is  the  maximum  in  each  state.  This,  we 
already  know  is  infeasible,  whence  we  will  branch  from  TQ^ . 
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TQ1  = (3.960,6.720)  PQ1  = (1,2) 


Our  first  level  is  obtained  by  fixing  the  first  component  to  each 
available  alternative  in  state  1 and  selecting  the  maximum  in  state  2, 
consistent  with  the  constraints  imposed  by  the  first  component.  Here, 
we  only  have  one  alternative  available  in  state  1,  namely,  alternative 
1.  Hence,  we  set  d^  = 1.  This  immediately  makes  d^  = 1 by  virtue 
of  (2.72).  Hence , 


T11  = (3.960,3.333)  P = (1,1) 


P^  is  feasible,  whence  T is  immediately  fathomed  and  also  gives 
the  largest  lagrangian  over  the  defined  set.  Hence,  BB  yields  a policy 
B = (1,1,2).  But  this  is  the  same  as  policy  A that  we  entered  with. 

Hence,  A maximizes  the  gain  over  the  set  of  all  feasible  policies 


that  differ  with  it  in  exactly  one  component  in  the  coupled  states.  This 
is  achieved  by  adding  the  constraint: 


d^  + d < M - 2 

1 2 — c 


where  M is  the  number  of  coupled  states.  Here,  M =2.  Thus, 
c ’ c ’ 


84 


r 


11^ 

dx  + d2  < ° 


(2.74) 


Note  that  (2.74)  makes  d^  = d^  = 0,  i.e.,  those  alternatives  are  re- 
moved from  further  consideration.  We  let  the  algorithm  deal  with  that, 
however,  rather  than  scanning  every  constraint. 

Now  we  compare  g(A)  = 9.366  to  g*  = -<».  g(A)  is  greater,  so  we 
set  our  optimum  policy  Pq,  so  far,  as 


P = A = (1,1,2) 
o ’ 


g*  = 9.366 


and  look  for  a feasible  solution  by  maximizing  L over  the  coupled 
states,  with  y.  unrestricted  in  sign.  Thus,  the  values  considered  in 
the  BB  are  given  by 


State  Ordered  Test  Quantities 


x‘: 


The  BB  tree  is  given : 


©' 


T = (3.960,6.720) 

T = (2.077,6.720) 


poi  " (1’2) 
Pn  = (3,2) 


u 


where  P is  feasible. 
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Hence,  we  have  a new  policy  A = (3,2,2)  for  which  we  perform  VD 
to  obtain 

State  Ordered  Test  Quantities 


1 

2 

*1  = 

1.041 

‘1 

= 0.579  t^ 

= 0.311 

2 

2 

t2  “ 

2 

*2 

= 9.086 

3 

2 

t3 

1.511 

*3 

= 0.948  tg 

= -0.182 

A = 

(3,2,2) 

g(A)  = 12.774 

Regular  PI  gives  (2,2,2)  an  infeasible  policy.  Hence,  we  maximize  in 
3 to  get  P(3)  = 2,  and  we  enter  BB  for  states  1 and  2 with  the  follow- 
ing table  . 


State  Ordered  Test  Quantities 


AD-A034  249  STANFORD  UNIV  CALIF  DEFT  OF  EN6 1 NEER I N6-E C ONOM I C SYSTEMS  F/9  12/1 
MARKOV  DECISION  PROCESSES  KITH  POLICY  CONSTRAINTS. (U) 

APR  76  J NAFEH  NSF-9K-36491 

UNCLASSIFIED  EES-OA-76-3  ML 

5’<*| 

A034240 


Thus,  BB  gives  us  (3,2,2),  the  policy  we  entered  with.  g(A)  = 12.774 
and  g*  = 9.366.  Consequently,  we  update  our  optimum  policy  and  gain 


P = (3,2,2)  e = 12.774 

o 


and  add  the  constraint 


(2.75) 


Then  we  try  to  get  another  feasible  policy.  The  following  table 
gives  the  values  considered  by  BB,  followed  by  the  BB  tree. 


State  Ordered  Test  Quantities 


(Of  course,  we  have  the  additional  two  constraints  (2.74)  and  (2.75).) 


Tq1  = (1.041,20.69)  PQ1  = (2,2) 


P is  infeasible,  and  no  branching  is  possible  from  level  0.  This  is 

an  indication  that  the  problem  is  infeasible.  (This  is  the  effect  of 

1 2 

(2.74)  and  (2.75).  They  force  d2  = 0 and  d2  = 0,  i.e.,  no  alterna- 

tives can  be  selected  in  state  2,  whence  infeasibility.)  Thus,  we  have 
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exhausted  the  set  of  feasible  policies  (without  evaluating  each  policy, 
merely  by  removing  subsets),  and  we  have  an  optimum  policy: 


Pq  = (3,2,2)  g*  = 12.774 


The  unconstrained  problem  had  its  optimum  at  (2,2,2).  That  policy, 
however,  violates  (2.73),  whence  it  is  infeasible.  The  introduction  of 
constraints  thus  affects  the  policy  (in  some  cases,  it  might  not  be  true , 
as  we  shall  later  show,  when  discussing  sensitivity  to  constraints),  and 
the  feasible  policy  yielding  the  highest  gain  is  the  one  we  obtained. 
(Incidentally,  it  turns  out  to  be  more  beneficial  for  our  friend  to  be- 
come a union  member.) 


Transient  States  and  Periodic  Processes 


In  the  foregoing,  all  states  were  assumed  to  be  recurrent,  i.e., 

> 0 for  all  states.  If  we  have  transient  states  and  they  happen  to 
be  coupled,  the  theorem  of  Section  C would  not  apply.  This  is  because 
the  test  quantities  are  multiplied  by  it  , whereas  the  it  are  defined 
on  the  test  quantities.  As  long  as  it  > 0,  then  inequalities  of  test 
quantities  are  not  affected  by  multiplication  by  it  . If  tt^=0,  how- 
ever, then  multiplication  by  it  equates  all  test  quantities  in  state 
i to  zero,  and  that  does  not  necessarily  imply  7^’s  °f  zero.  Since 
the  proof  of  the  theorem  was  based  on  that  fact , it  does  not  hold  for 
transient  states.  In  other  words,  it  is  not  true  that  BB  converges  to 
a policy  that  maximizes  the  gain  over  the  subset  of  feasible  policies 
differing  with  it  in  exactly  one  state  if  one  of  the  coupled  states  is 


transient . 
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Hence,  we  need  to  deal  with  coupled  transient  states  separately.  We 
do  this  in  the  following  manner.  We  fix  the  alternatives  in  the  recur- 
rent coupled  states  to  those  of  the  current  policy  (i.e.,  set  the  corre- 
sponding  d^  = 1).  Then  we  use  BB  with  7^  > 0 to  maximize  the  sum  of 
the  test  quantities  over  the  transient  coupled  states.  This  is  basically 
a feasibility  exploration.  If  this  process  results  in  a change  in  tran- 
sient coupled  state  alternatives  (in  at  least  one  state),  we  have  an  im- 
proved policy.  If  not,  we  fix  the  alternatives  in  the  transient  coupled 
states  to  those  of  the  current  policy  and  use  BB  with  7^  > 0 to  maxi- 
mize the  Lagrangian  over  the  recurrent  coupled  states.  If  the  policy 
does  not  change,  then  it  maximizes  the  gain  over  the  set  of  policies 
that  differ  with  it  in  exactly  one  coupled  state. 

Finally,  a word  about  periodic  processes.  By  a periodic  process, 
we  mean  one  whose  transition  probability  matrix  is  periodic.  In  this 
case,  rr^  = 1/N  for  all  states.  The  case  of  interest  is  one  in  which 
all  feasible  policies  are  periodic.  We  know  that 


» Ki 

- v -*  v 

i=l  k=i 


k k 

Vi 


and  it  is  g we  want  to  maximize.  If  all  policies  are  periodic,  then 
we  want  to  maximize 


N Ki 


1 < \~  kk 

N > > Vi 


i=l  k=l 


In  this  case,  we  only  need  one  iteration.  Our  initial  feasible 
policy  is  the  optimum  one  since  we  use  BB  to  maximize  the  sum  of  q. 
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over  the  feasible  policy  set  to  get  it.  Of  course,  this  is  an  inher- 


ently deterministic  problem  (we  just  maximize  the  average  immediate  re- 
ward). Unless  we  know  a priori  that  all  feasible  policies  are  periodic, 
we  do  not  seek  to  find  that  out  (since  it  is  computationally  equivalent 
to  explicitly  enumerating  the  feasible  policy  set).  The  only  reason  for 

discussing  periodic  policies,  is  to  shed  some  light  on  initial  feasible 
t 

policies  and  transient  coupled  states.  Maximizing  a sum  is  equivalent 
to  maximizing  the  average  if  a uniform  probability  distribution  is  as- 
sumed. The  latter  is  characteristic  of  the  rr^  of  periodic  policies. 
The  uniform  distribution,  however,  is  the  mathematical  encoding  of  a 
Bayseian's  profession  of  complete  ignorance  of  a process.  In  the  case 
of  periodic  policies,  we  do  not  know  where  we  will  find  the  process  if 
we  enter  it  at  a random  point  in  time.  This  is  essentially  what  we  are 
saying  at  the  start  of  the  algorithm  when  we  do  not  have  any  feasible 
policy  available  (we  do  not  have  a transition  probability  matrix).  In 
the  case  of  transient  coupled  states,  our  Lagrangian  has  degenerated, 
making  all  test  quantities  equivalent  and  thus  obliterating  our  accumu- 
lated knowledge  to  this  point.  Consequently,  we  profess  complete  ignor- 
ance about  those  states  and  maximize  the  sum  (i.e.,  average)  of  the 
original  test  quantities. 


Chapter  III 


RISK-SENSITIVE  MARKOV  DECISION  PROCESSES 
A . Introduction 

In  this  chapter,  we  develop  an  algorithm  for  risk-sensitive  Markov 
Decision  Processes  with  policy  constraints. 

As  in  Chapter  II,  we  start  by  reviewing  previous  work  in  Section  B. 
Instead  of  expected  values,  we  deal  with  utility  functions  and  certain 
equivalents,  the  standard  method  of  incorporating  a decision  maker's 
attitude  towards  risk  (i.e.,  uncertain  propositions).  The  policy  evalu- 
ation-policy improvement  (PE-PI)  algorithm  developed  by  Howard  and  Math- 
eson  [61  is  outlined.  It  is  the  counterpart  of  the  VD-PI  algorithm  for 
the  risk-indifferent  case.  Just  as  our  algorithm  of  Chapter  II  was  based 
on  the  VD-PI,  our  algorithm  here  is  based  on  the  PE-PI. 

In  Section  C,  we  use  the  fact  that  the  original  problem  is  actually 
a constrained  optimization  problem  to  formulate  it  in  the  Lagrangian 
framework.  We  show  that  decomposing  the  optimization  of  the  Lagrangian 
into  two  problems  results  in  the  PE-PI  algorithm  when  no  policy  con- 
straints are  present.  The  dependence  of  the  algorithm  on  the  sign  of 
the  risk  aversion  coefficient  7 is  brought  out.  In  the  case  of  a risk 
preferring  decision  maker  (7  < 0) , the  problem  is  one  of  maximization. 
For  7 > 0,  it  is  a minimization  problem.  In  PE-PI,  making  the  utili- 
ties of  opposite  sign  to  7 during  the  PE  phase,  transforms  the  problem 
into  one  of  maximization  for  both  cases.  Hence,  the  sign  of  the  utili- 
ties is  not  really  arbitrary.  It  has  to  be  opposite  to  that  of  7; 
otherwise  minimization  would  have  to  replace  maximization  in  the  PI 
phase . 
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Introduction  of  policy  constraints  makes  the  PI  phase  inapplicable 


just  as  was  the  case  for  risk-indifferent  analysis.  Therefore,  we  have 
to  look  elsewhere  for  solving  the  essentially  combinatorial  problem  of 
policy  improvement. 

In  Section  C,  we  also  show  that  the  Lagrange  multipliers  are  the 
utilities  of  the  PE-PI . They  represent  the  cost  of  violating  the  con- 
straints; in  this  case,  the  constraints  defining  an  eigenvalue  problem. 
The  realization  that  the  constraints  in  the  risk-indifferent  case  (the 
equations  defining  the  limiting  state  probabilities)  also  define  an  ei- 
genvalue problem,  leads  us  to  discover  the  counterpart  of  the  limiting 
state  probabilities.  Whereas,  in  the  risk-indifferent  case,  the  eigen- 
value problem  pertained  to  the  transition  probability  matrix  of  a pol- 
icy; in  the  risk-sensitive  case,  it  pertains  to  the  matrix  of  "disutil- 
ity contributions"  of  the  policy,  a combined  measure  of  the  probabilities 
of  transitions  and  how  much  they  contribute  to  "disutility."  In  this 
case,  we  have  an  eigenvector  Z defining  the  equilibrium  flow  of  those 
quantities,  just  as  the  vector  II  of  limiting  state  orobahilities  de- 
fined the  equilibrium  of  probabilistic  flows  in  the  risk-indifferent 
case.  It  is  the  components  of  this  vector  Z which  should  weight  the 
test  quantities  in  each  state  when  policy  constraints  are  present. 
Whereas,  weighting  by  IT  was  intuitively  obvious  in  the  risk-indiffer- 
ent case  (it^  being  the  time  the  process  spends  in  state  i,  on  the 
average,  in  the  long  run),  the  weighting  in  the  risk-sensitive  case  al- 
most defies  intuition.  At  the  very  least,  it  is  not  transparent.  It  is 
the  Lagrange  multiplier  formulation  that  brings  it  out. 

In  Section  D,  we  set  out  to  develop  an  algorithm  similar  to  the 
one  we  developed  in  Chapter  II.  We  outline  a sufficient  condition  for 
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guaranteeing  policy  improvement  (since,  as  before,  maximizing  the  Lag- 
rangian  does  not,  per  se , guarantee  that).  As  before,  it  turns  out  that, 
if  in  every  state  we  disregard  alternatives  whose  test  quantities  are 
less  than  that  of  the  current  policy,  the  polic.y  is  guaranteed  to  improve 
if  PI  changes  it.  We  still  retain  the  division  into  "free"  and  "coupled" 
states,  improving  the  free  states  by  regular  PI  and  the  coupled  states  by 
BB  imposing  the  sufficient  condition  for  improvement.  Convergence  of  BB 
to  a given  policy  still  means  that  that  policy  is  optimum  over  the  subset 
of  feasible  policies  that  differ  with  it  in  exactly  one  coupled  state. 
In  such  cases,  we  make  those  policies  infeasible  and  continue. 

If  the  policy  constraints  do  not  make  the  best  possible  policy  in- 
feasible, then  retaining  the  feature  of  maximizing  test  quantities  in 
each  state  still  enables  us  to  detect  that  policy  without  having  to  ex- 
haust the  feasible  policy  set.  The  convergence  of  the  algorithm  is  also 
proved . 

In  Section  E,  we  apply  the  algorithm  to  the  problem  of  Chapter  II 
when  the  taxicab  driver  is  not  risk-indifferent. 

B . Markov  Decision  Processes  without  Policy  Constraints 

In  Chapter  II,  the  decision  maker  was  assumed  to  be  risk-indiffer- 
ent, whence  the  basic  premise  was  to  maximize  the  expected  value  of  out- 
comes. For  a risk-sensitive  decision  maker,  we  have  to  maximize  the  ex- 
pected value  of  the  "utilities”  of  outcomes , where  those  "utilities"  are 
defined  by  the  decision  maker's  "utility  function."  The  latter  encodes 
his  attitude  towards  risk  if  he  subscribes  to  certain  arguments  regard- 
ing risky  propositions  or  "lotteries."  An  outcome  having  value  v is 
assigned  a utility  u(v),  and  the  expected  value  of  the  utilities  is 
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called  the  utility  of  the  lottery.  An  important  concept  in  risk-sensi- 


tive analysis  is  that  of  certain  equivalent  (CE)  . The  CE  of  a lottery 
is  the  value  whose  utility  is  the  same  as  the  utility  of  the  lottery 

u(v)  = u(v)  (3.1) 

Thus,  if  we  have  a lottery  whose  utility  is  x,  its  certain  equiv- 
alent v is  given  by 

v = u 1(x)  (3 .2) 

where  u ^ is  the  inverse  of  the  utility  function  u. 

We  will  restrict  ourselves  to  dealing  with  a decision  maker  who 
subscribes  to  what  is  known  as  the  delta  property.  If  all  prizes  in  a 
lottery  are  increased  by  the  same  amount  A,  his  certain  equivalent 
for  the  lottery  increases  by  A.  Such  a decision  maker  possesses  a 
utility  function  which  is  either  linear  or  exponential.  The  linear  case 
implies  risk  indifference,  so  we  will  work  with  exponential  utility 
functions 

u(v)  = -(sgn  7)  e V (3.3) 


u 


u 1(x)  = - — In  [(-sgn  7)  x] 


(3.4) 


where  7 is  the  risk  aversion  coefficient.  Risk  averters  have  a posi- 
tive "v,  while  risk  preferers  have  a negative  7.  (sgn  7)  denotes  the 
sign  of  7.  An  important  implication  of  the  exponential  utility  func- 
tion is  the  following: 
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i .e . , 


u(v  +A)  = -(sgn  7)  e 


u(v  +A)  = e u(v) 


-TV 

= -(sgn  y)  e e 


(3.5) 


Adding  a constant  A to  all  lottery  prizes  causes  their  utilities  to  be 

— 7A 

multiplied  by  e 

Now  we  are  in  a position  to  analyze  the  Markov  Decision  Process  for 
a decision  maker  possessing  the  A property,  i.e.,  an  exponential  util- 
ity function  given  by  (3.3)  and  (3.4).  As  usual,  we  first  consider  the 
limited  time  horizon  and  then  let  n tend  to  oo.  Given  a certain  policy 
(i.e.,  a probability  transition  matrix  and  associated  reward  matrix)  , the 
process  will  generate  a total  reward  v^(n  +1)  if  it  is  in  state  i and 
is  allowed  to  continue  for  n +1  transitions.  This  uncertain  reward  has 
a certain  equivalent  v^(n+l).  The  CE  is  that  amount  the  decis  ion  maker 
would  be  willing  to  take  for  certain  instead  of  receiving  the  uncertain 
reward  generated  by  the  Markov  process.  It  can  be  shown  [ 6 1 that  this 
CE  is  given  by 


N 

u (v.  (n  + 1)J  = 'S'  p.  .u  [r . . + v (n)l 

\ 1 / jti  ij  L U .1  J 


(3.6) 


Using  the  property  given  in  (3.5),  we  can  reduce  (3.6)  to 


U ^i (n  + = Pjj  e ^ u^Vj(n)^ 


(3.7) 


If  we  define  the  utility  of  being  in  state  j,  with  n transitions 


remaining,  as  u^(n), 


-TV . (n ) 


u ^ (n)  = ulv^(n)l  = -(sgn  y)  e 


(3.8) 
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Then  we  can  write  (3.7)  simply  as 


u (n  + 1)  = ^ p e u.(n)  (3.9) 

1 j=l  J 1 

-yr  • ^ 

In  the  case  of  risk  aversion  (positive  y)  , the  term  e 1J  is 

the  negative  utility,  or  the  "disutility,"  of  the  reward  r^.  The  term 
"disutility"  will  be  retained  regardless  of  the  sign  of  y.  If  we  define 
the  "disutility  contribution"  of  the  transition  from  i to  j as 


q.  . = P.  . e 
ij  U 


-yr 

ij 


(3.10) 


then  we  have  a disutility  contribution  matrix  Q with  elements  q^. 
which  are  nonnegative.  In  this  case,  (3.9)  becomes 


u. (n  + 1) 

l 


N 

V 

ja. 


q,  .u  . (n) 
ij  J 


(3.11) 


This  is  the  recursive  relation  for  computing  successive  utilities  of  the 
process.  To  find  the  certain  equivalents,  we  refer  to  the  definition  of 
the  utilities  given  by  (3.8)  and  use  (3.4)  to  get 

u.(n)  = - i In  ^-(sgn  y)  u^(n)j  (3.12) 

To  see  what  happens  when  we  allow  the  time  horizon  to  be  infinite, 
we  first  write  (3.11)  in  vector  form.  Defining  U(n)  as  the  vector 
whose  components  are  iu(n),  we  can  write  (3.11)  as 


U(n  + 1)  = Q • U(n) 
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Tli is  gives 


U(n)  = Q . U(0) 


(3.13) 


If  the  Markov  transition  probability  matrix  P is  irreducible  and  acy- 

* 

clic,  then  Q is  irreducible  and  primitive.  In  this  case,  it  can  be 
shown  that 


lim  Qn  • U(0)  = lim  J— \ U(n)  = k • U 

n^>oo  n oo  \A  / 


(3.14) 


where  A is  the  largest  eigenvalue  of  Q,  and  U is  the  corresponding 
eigenvector  with  k chosen  such  that  u^  = -(sgn  y)  . Thus,  for  large  n 
the  utility  of  any  state  is  multiplied  by  A at  each  successive  stage. 
Equations  (3.12)  and  (3.14)  may  be  used  to  show  [6]  that  the  asymptotic 
form  of  the  certain  equivalent  can  be  written  as 


v.(n)  =ng  + v . +c 

l l 


(3.15) 


where  the  certain  equivalent  gain  g of  the  process  is  given  by 


g = - - m A 


(3.16) 


A reducible  matrix  A is  one  for  which  a permutation  exists  to  place 
it  in  the  form 


[::] 


where  B and  D are  square  matrices.  Otherwise,  A is  irreducible. 
An  irreducible  P is  one  in  which  all  states  communicate.  A matrix  A 
is  primitive  if  some  power  of  A has  all  elements  positive.  A primi- 
tive P is  called  acyclic. 
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consideration,  and  we  will  be  seek- 
To  compute  g for  a given  policy, 
and  then  use  (3.14)  to  get 

i = 1,2 N (3.17) 

largest  eigenvalue  A of  Q,  which 

redundant  equation  exists.  This  is 

overcome  by  setting  u = -(sgn  ■> ) , which  makes  vx  = 0.  However, 

choosing  a value  for  u„  is  not  completely  arbitrary.  That  value  has 

N 

to  have  the  opposite  sign  of  y (more  about  this  in  Section  C) . As  soon 
as  we  have  A,  (3.16)  gives  us  g for  the  policy  under  consideration. 
This  phase  of  the  algorithm  is  called  policy  evaluation  (PE).  We  need  a 
policy  improvement  (PI)  phase.  As  in  the  risk-indifferent  case,  test 
quantities  are  defined,  and  improving  the  policy  reduces  to  maximizing 
the  test  quantities  in  each  state.  The  utilities  replace  the  relative 
values,  and  the  disutility  contributions  replace  the  probabilities.  In 
other  words , we  select  an  alternative  k in  each  state  i such  that 


g is  a property  of  the  policy  under 


ing  that  policy  which  maximizes  it. 
we  divide  (3.11)  by  A*1,  let  n^co, 


N 

V 

j=i 


q . .u  . = Au  . 
ij  .)  i 


(3.17)  has  to  be  solved  for  the 
makes  (3.17)  of  rank  N-l,  i.e.,  a 


u 


k 

i 


k-,1, 


max 


k 

q . .u  . 
ij  .7 


(3.18) 


This  is  the  equivalent  of  (2.57).  The  immediate  expected  rewards  are 
not  explicitly  present  because  they  are  included  in  the  utilities.  The 
algorithm  terminates  when  PI  yields  the  same  policy  that  we  entered  it 
with  . 
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c. 


Markov  Decision  Processes  with  Policy  Constraints 


From  the  discussion  of  the  previous  section,  we  can  view  the  risk 
sensitive  Markov  Decision  Process  as  follows. 

Each  policy  P,  made  up  of  an  alternative  in  each  state,  has  asso- 
ciated with  it  a disutility  contribution  matrix  Q.  The  maximal  eigen- 
value A^  of  Q determines  the  certain  equivalent  gain  of  the  policy 
g = -1/-V  In  A . Thus,  we  are  confronted  with  the  task  of  selecting  that 
Q which  yields  the  largest  value  of  g.  The  sign  of  y (type  of  atti- 
tude towards  risk)  determines  our  objective.  For  a risk  prefering  deci- 
sion maker,  y is  negative.  Consequently,  maximizing  g is  the  same 
M 

as  maximizing  A . For  a risk  averse  decision  maker,  however,  y is 

. M 

positive,  and  it  is  the  smallest  h which  gives  the  largest  g. 

M 

We  have  thus  ascertained  that  our  objective  function  is  A , the 

maximal  eigenvalue  of  Q,  whence  our  constraints  are  those  defining  the 

M 

eigenvalue  problem  that  yields  A . To  gain  more  insight  into  this,  we 
reconsider  the  risk-indifferent  case.  If  we  write  the  constraints  defin- 
ing the  limiting  state  probabilities  IT  for  a transition  probability 
matrix  P in  vector  form,  we  get 

T T 

TT  • P = H (3.19) 

It  is  then  apparent  that  ]T  is  a "left  eigenvector"  of  P (or  eigen- 
vector of  its  transpose)  with  a corresponding  eigenvalue  of  unity.  But 
that  eigenvalue  is  the  maximal  eigenvalue  of  P by  virtue  of  it  being  a 
stochastic  matrix.  This  holds  for  any  policy.  Therefore,  in  the  risk- 
indifferent  case,  all  maximal  eigenvalues  are  unity.  The  corresponding 
left  eigenvectors  then  define  the  objective  function  through 
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g 


N 

. v 


L Vi 

i=l 


In  the  risk-sensitive  case,  we  deal  with  the  Q matrices.  Their 
maximal  eigenvalues  differ  and,  solving  the  eigenvalue  problem,  yields 


the  objective  function  through  the  eigenvalue  rather  than  the  eigenvec- 
tor. To  get  the  constraints,  then,  we  define  a vector  Z,  with  compo- 
nents z^,  as  the  left  eigenvector  of  Q.  Thus,  Z is  the  counterpart 
of  II  (more  about  that  later),  and  it  should  satisfy 


T ~ J „T 
Z • Q = A • Z 


(3.20) 


(3.20)  is  the  counterpart  of  (3.19),  i .e . , for  a given  policy.  To  take 
alternative  selection  into  account,  we  use  the  d^  on  the  rows  of  Q, 
as  we  did  in  the  risk-indifferent  case.  We  will  first  concentrate  on 
the  risk  prefering  case,  i .e . , y < 0.  Here,  we  are  dealing  with  the 
constrained  maximization  problem: 


max  A 


(3.21) 


subject  to 


N i . . 

I V 

i=l  k=l 


A z ^ = 0 


j = 1,2 N 


(3.22) 


V dj  - 1 a 0 


i = 1,2,  . . . , N 


(3.23) 


1 


! (d  i ) < 0 1 = 1,2 m (i,k)  € Sl| 

p(di)  = ° P ~ 1,2 (i,k)  d S2| 


(3.24) 


where  A is  the  largest  positive  number  satisfying  (3.22)  and  and 

S9  are  subsets  of  S = ((i,k))  the  set  of  all  (i,k)  pairs  defining 
each  alternative  in  each  state. 

The  constraints  (3.23)  and  (3.24)  are  those  we  previously  encoun- 
tered as  (2.37)  and  (2.46).  The  constraints  (3.22)  are  not  linear, 
whence  there  is  no  LP  equivalent  of  (3.21)  through  (3.24).  Consequently, 
we  cannot  apply  the  LP  result  which  guarantees  that  the  di  will  turn 
out  to  have  zero-unity  values.  Rather,  when  solving  the  problem,  we  will 

restrict  ourselves  to  those  values.  While  other  values  satisfying  (3.23) 

_ M 

might  give  larger  values  for  A , our  objective  function,  we  reject 
them.  The  reason  is  that  they  represent  "randomized  strategies,"  a con- 
cept which  is  meaningless  in  our  context. 

As  noted  before,  a set  of  di  describing  a policy  P determines  a 
disutility  contribution  matrix  Q(P)  and  its  associated  left  eigenvec- 
tor Z defined  by 


T M T 

Z • Q(P)  = A • Z (3.25) 

For  an  irreducible  primitive  matrix  Q,  the  components  of  the  maximal 
eigenvector  all  have  the  same  sign.  They  are  either  all  positive  or 
negative.  (The  transpose  of  an  irreducible  primitive  matrix  is  also  ir- 
reducible and  primitive.) 

As  in  Chapter  II,  we  first  concentrate  on  (3.21)  through  (3.23), 
i.e.,  unconstrained  policies.  We  showed  that  the  maximizer  of  the 
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constrained  problem  is  a critical  point  of  the  corresponding  Lagrangian. 

Here,  we  have  2N  Lagrange  multipliers  corresponding  to  (3.22)  and 

(3.23).  The  first  N of  those,  associated  with  (3.22),  we  denote  by 

u, u . The  other  N ones,  we  denote  by  (3  . Consequently,  our 

1 2 N l 

Lagrangian  is 


L — A + 


N N 


j=l  |_i=l 


(3.26) 


To  maximize  our  constrained  function,  we  seek  critical  points  of  L.  As 
before,  we  only  set  certain  partial  derivatives  to  zero  to  evaluate  a 
policy,  then  we  try  to  change  the  policy  so  as  to  maximize  L (because 
the  values  of  the  constrained  function  and  L are  identical  whenever  a 
policy  is  evaluated). 

M 

Setting  the  partial  derivatives  of  L w.r.t.  the  z^  and  A equal 
to  zero , we  get 


dL  v V'  k 

Jzi  Z j Z u 1 


k > n 

. .d  - A U , = 0 

l.l  l i 


i = 1,2,  . . . , N (3.27) 


ZL  = 1 - S Z.u  =0 


(3.28) 


For  a given  policy  P,  the  d^  are  all  zero  or  unity  withonlyone 
d.  equal  unity  in  each  state.  Thus,  in  (3.27),  the  summation  over  k 
reduces  to  one  element  for  each  i.  This  defines  the  Q(P)  of  the  pol- 
icy under  consideration.  Defining  U as  the  vector  whose  components 


are  u^,  (3.27)  and  (3.28)  can  be  written  as 
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Q(P)  • U = AN1U  (3.29) 

ZT  • U = 1 (3.30) 

Differentiating  L w.r.t.  the  and  equating  to  zero  gives  us  the 

original  constraints  (3.25) 

T M T 

Z • Q(P)  = A • Z (3.25) 

The  system  (3.29)  is  the  PE  phase  of  the  PE-PI  algorithm.  It  is  an 

eigenvalue  problem,  whence  the  rank  of  the  system  is  N-l.  At  this 

point,  there  is  nothing  to  indicate  that  setting  u,  to  some  particular 

N 

value  has  any  significance.  We  know,  however,  that  U,  as  well  as  Z, 
has  components  which  are  either  all  positive  or  all  negative.  Equation 
(3.30)  tells  us  that  Z and  U have  the  same  signs.  This  will  become 
significant  in  the  PI  phase.  What  concerns  us  here  is  the  relationship 
of  the  solutions  of  (3.25)  and  (3.29). 

In  matrix  form,  (3.29)  can  be  written  as 


r 

qll  - A 

q12  ' * * 

' • • - qiN 

ui 

0 

q21 

,M 

q22  " A ... 

' ' ‘ ’ q2N 

U2 

= 

0 

Si 

qN2  • • ‘ 

* S 1 

(< 

1 

cr 

_UN_ 

0 



This,  as  we  noted  earlier,  is  a system  of  rank  N-l.  If  we  set 
u = a,  say,  we  get 
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M 

— — 

qll  X ql2  qi,N-l 

U1 

qiN 

, M 

q21  q22  X q2  ,N-1 

U2 

q2N 

• 

= -a 

• 

, M 

• 

* 

a q — A 

'N-l , 1 hN-1 , 2 4N-1 ,N-1 

_UN-!_ 

_qN-l,N 

(3.31) 

where  we  dropped  the  last  equation.  The  resulting  system  can  be;  written 
as 

A • U = V 

where  A is  the  matrix  on  the  L.H.S.  of  (3.31),  and  V is  the  vector  on 
the  R.H.S.  U is  the  N-l  vector  composed  of  the  first  N-l  compo- 
nents of  U.  Hence, 

U = A_1  • V (3.32) 

Now  we  write  (3.25)  in  matrix  form: 


qll  “ A 

q21  

“1 

Si 

Z1 

0 

qi2 

,M 

q22  " X 

qN2 

• 

Z2 

= 

0 

L qiN 

q2N 

qNN  “ A _ 

_ZN_ 

! o 
L J 

This  being  a 

system  of  rank  N-l, 

we  can  set 

ZN 

= b 

t 

say 
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qll'A 

q21  * • • 

* * * * Vl.l 

Z1 

r ~i 

qNl 

q!2 

q22"A  • • * 

’ ’ * • qN-l,2 

Z2 

qN2 

* 

• 

= -b 

• 

qlN 

q2N  * • * 

' ' ' ' qN-l,N-l 

>1, 

_qN,N-l_ 

(3.33) 

Denoting  the  N - 1 vector  whose  components  are  z , . . . ,z  by  Z 

1 N-l  — 

and  the  R.H.S.  vector  of  (3.33)  by  W,  we  get 


T 

A • Z = W 

where  A is  the  same  matrix  as  in  (3.31).  Thus, 


(3.34) 


And  thus  solving  (3.32)  implies  that  (3.34)  is  solved.  All  we  need  to 

get  Z,  once  we  have  U,  is  to  transpose  the  inverse  we  already  have 

and  multiply  it  by  the  transpose  of  the  last  row  of  A (and  the  value 

of  z ).  This  is  significant  for  the  case  of  policy  constraints. 

N 

Now  we  interpret  the  Lagrange  multipliers  u^  and  the  z^.  The  Z 
plays  here  the  role  that  the  II  plays  in  the  risk  insensitive  case. 
Note  that  the  constraints  defining  the  limiting  state  probabilities  are 


n 


T 


n 


T 


• 1 = 1 
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where  .1  is  the  vector  whose  components  are  all  unity.  That  vector  is 


the  eigenvector  of  P correspond ing  to  71: 


P . 1 = 1 


Moreover,  the  common  eigenvalue  is  unity,  the  maximal  eigenvalue  for  any 
transition  probability  matrix  P.  Hence,  (3.25)  represents  the  "equi- 
librium of  disutility  contribution  flows"  in  the  limiting  case,  exactly 
like  the  constraints  on  the  limiting  state  probabilities  did  for  the 

probabilistic  flows.  The  difference  is  that  here  the  "outflow"  is  mul- 
M 

tlplied  by  A , correspodning  to  the  fact  that,  in  the  limit,  utilities 
are  multiplied  by  every  transition.  The  u^,  being  Lagrange  mul- 

tipliers, give  us  the  cost  of  violating  the  constraints  per  unit  of  dis- 
equilibrium .just  as  the  v's  were  in  the  risk-insensitive  case. 

Now  we  proceed  to  PI.  Once  we  have  evaluated  a policy,  we  want  to 
improve  it.  We  make  maximizing  the  Lagrangian  our  objective  (providing 
the  guarantee  of  improvement  outside  the  Lagrangian  framework,  as  be- 
fore).  To  do  that,  we  rewrite  L to  bring  out  the  dependence  on  d. . 


L " I z‘  1 uj  1 

i=l  j=l  k=l 


k ,k  . M 

Vi  * A ' 


N 

I 

j=l 


ZjUj 


N 

+ V 

i=i 


y 

Lk=i 


di  - 11 


For  the 


N 

given  policy  z^u^=l  (from  (3.30))  and  for  any  policy 

whence  we  only  have  to  contend  with  the  first  term: 


N N Ki 

\ Zi  \ Uj  \ qijdi  (3.35) 

i=l  jTl  k^l 
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(3.35)  is  the  inner  product  of  two  vectors.  Tiie  first  vector  Z 


is  constant.  The  second  is  policy  dependent.  Once  a policy  P is  se- 
lected, it  defines  a vector  T(P)  of  "test  quantities" 


N 

k \ k 

1 ' jSa  ^ 


(3.36) 


where  th^  q_  is  the  corresponding  element  of  the  Q selected  by  pol- 
icy  P.  The  sign  of  t^  is  the  same  as  that  of  U,  and  hence  also  Z. 

Thus,  if  Z and  U are  chosen  positive  (u„  = -(sgn  •> ) because  we  are 

dealing  with  the  case  y < 0) , then,  to  maximize  L,  we  have  to  select 

that  feasible  T(P)  with  largest  components.  Otherwise,  we  would  have 
to  select  that  one  with  the  smallest  components  (if  Z and  U are  neg- 
ative). In  other  words,  if  the  sign  of  uT  is  the  same  as  that  of  y, 

N 

the  test  quantity  has  to  be  minimized  in  order  to  maximize  the  Lagrang- 

ian.  We  will  assume  hereafter  that  in  PE  we  take  ur  = z_,  = -(sgn  y)  . 

N N 

In  the  absence  of  policy  constraints,  we  can  select 


*i  = 


max 


k=l,2 K, 


' N 

V k 

>1 


(3.37) 


Since  the  z^  are  nonnegative,  this  choice  maximizes  the  Lagrangian, 
and  we  have  the  PI  phase  of  Section  B.  If  we  have  policy  constraints, 
however,  such  a policy  might  not  be  feasible.  In  this  case,  we  have  to 
consider  (3.35)  as  a whole.  In  essence,  we  redefine  the  components  of 
T to  be 


N 

* V k 

*i  = QiJUj 


(3.38) 
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where  the  policy  P determines  the  alternative  in  each  state 

P(i)  = k 

The  Lagrangian  then  becomes  merely  the  sum  of  the  components  of 
T(P) . Here,  the  original  test  quantities  have  been  weighted  by  the  cor- 
responding z^,  the  variables  controlling  the  equilibrium  of  disutility 
contribution  flows.  Whereas,  in  the  risk-insensitive  case,  weighting 
test  quantities  by  the  limiting  state  probabilities  was  intuitive  (the 
process  spends  it  of  the  time  in  state  i,  on  the  average),  this 
weighting  cannot  be  intuitively  derived  in  the  risk-sensitive  case.  Z 
encodes  the  limiting  behavior  from  both  the  probabilistic  and  risk  at- 
titude aspects. 

Note  that,  if  the  test  quantities  are  defined  by  (3.38),  i.e.,  we 

multiply  by  z^,  then,  when  maximizing  the  Lagrangian,  we  do  not  have 

to  worry  about  the  signs  of  Z and  U.  They  cancel  out.  Thus,  in  the 

absence  of  policy  constraints,  we  can  set  u„  = (sgn  y)  = -1  but  take 

N 

care  to  multiply  the  test  quantities  by  -1  before  maximizing  in  each 
state.  This  is  just  another  way  of  saying  that  u^  has  to  be  positive 
when  y is  negative. 

In  other  words,  (3.29)  is  not  really  a "free"  eigenvalue  problem, 
in  the  sense  that  we  are  not  free  to  choose  any  value  for  one  of  the 
u^.  The  values  have  to  be  of  different  sign  than  y in  the  absence  of 
policy  constraints  in  order  to  implement  the  PI  of  Section  B as  is. 
Otherwise,  the  test  quantity  has  to  be  minimized . It  is  the  realization 
that  PI  is  maximization  of  the  Lagrangian  which  led  us  to  detect  this 
dependence  on  sign  y.  As  mentioned,  this  dependence  can  be  removed  if 
the  test  quantities  are  multiplied  by  the  z^.  Since  this  is  what  we 
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do  when  policy  constraints  are  present,  we  need  not  worry  about  signs. 


We  will  select  = u , = -(sgn  7)  for  consistency.  (Either  Z or  U 
N N 

has  then  to  be  normalized  to  satisfy  (3.30).)  For  constrained  policies, 
we  maximize  the  Lagrangian  by  BB  as  before.  That  BB  does  maximize  the 
Lagrangian  has  already  been  proved.  The  proof  still  applies  because  the 
Lagrangian  was  only  assumed  to  be  the  sum  of  the  components  of  T(P) 
(which  it  still  is)  without  any  dependence  on  how  those  components  were 
obta ined  . 

For  the  case  of  risk  aversion,  positive  7,  we  previously  mentioned 

M ~ 

that  we  need  to  minimize  A in  order  to  maximize  g.  In  this  case,  our 

constrained  problem  is  the  same  as  (3.21)  through  (3.24),  except  that 

M 

(3.21)  is  replaced  by  min  A . What  applies  to  constrained  maximization 
applies  to  constrained  minimization,  as  far  as  the  Lagrange  multiplier 
rule  is  concerned  . 

Thus,  the  minimizer  of  the  constrained  problem  is  still  a critical 
point  of  the  Lagrangian.  Hence,  the  PE  phase  is  the  same.  When  we  get 
down  to  PI,  however,  we  would  like  to  select  P so  as  to  minimize  the 
Lagrangian,  In  the  absence  of  policy  constraints,  setting  u =-(sgni) 
makes  U and  Z both  negative.  Thus,  to  minimize  the  inner  product, 
the  variable  vector  composed  of  test  quantities  has  to  be  maximized  (be- 
cause it  will  be  multiplied  by  a negative  vector,  and  we  want  to  minimize 
the  result).  Hence,  setting  u = -(sgn  7)  in  the  unconstrained  poli- 
cies  case,  and  then  maximizing  the  resultant  test  quantities,  actually 
minimizes  the  Lagrangian.  This  is  what  is  really  sought  here.  For  con- 
strained policies,  we  merely  minimize  the  Lagrangian  without  worrying 
about  signs,  because  we  multiply  the  test  quantities  by  z^.  A better 
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approach  would  be  to  take  the  sign  of  7 into  consideration,  explicitly. 
This  could  be  done  by  redefining  the  test  quantities  as 

N. 

tk  = -(sgn  y)  z ^ qk  u (3.39) 

j=l  3 3 

For  a risk  preferrer  (7  < 0) , this  reduces  to  (3.38),  whence  all  the 
previous  applies.  For  )>0,  (3.39)  effectively  multiplies  the  Lagran- 

gian  by  -1.  In  other  words,  the  Lagrangian  defined  by  (3.39)  is  the  neg- 
ative of  that  defined  by  (3.38).  Since  we  want  to  minimize  the  latter, 
we  can  maximize  the  former.  Consequently,  PI  becomes  maximization,  ir- 
respective of  the  sign  of  y if  we  define  the  test  quantities  by  (3.39). 

To  summarize  then,  setting  u = -(sgn  7)  makes  PI  maximize  test 

N 

quantities  in  the  absence  of  policy  constraints.  While  this  involves 
dependence  on  the  sign  of  7,  it  automatically  takes  into  account  the 
fact  in  one  case  we  want  to  maximize  L and  in  the  other  minimize  it . 
When  policy  constraints  are  present,  we  have  to  explicitly  take  the  sign 
of  7 into  account  by  defining  the  test  quantities  as  in  (3.39),  whence 
we  always  maximize  L in  the  PI . 

D . Development  and  Convergence  of  the  Algorithm 

As  in  Chapter  II,  the  maximization  of  L does  not,  per  se , guaran- 
tee an  improvement  in  g.  We  have  to  provide  that  guarantee  outside  the 
Lagrangian  framework.  We  will  introduce  a sufficient  condition  for  im- 
proving g,  based  on  a condition  derived  by  Howard  and  Matheson  1 6 1 . If 
we  have  a policy  A,  then  the  test  quantity  corresponding  to  the  alter- 
native selected  by  any  other  policy  B in  state  i is  defined  as 
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(3.40) 


, . , .A  \ B A 

t.(B)=-(sgn,y)z.  ■ q . .u  . 

1 1 j^i  1J  j 


A A 

where  z.  and  are  the  values  obtained  by  the  PE  for  policy  A.  We 

will  define  A 's  analogous  to  the  y.'s  of  the  risK-ind if ferent  case 

l l 


A.  (B,A)  = t. (B)  - t. (A) 

l l l 


(3.41) 


where  the  t.  are  defined  by  (3.40) 

l 


Proposition  3.1. 


Given  a policy  A for  which  PE  has  been  performed  and  any  other 

~B  ~A 

policy  B,  a sufficient  condition  for  g > g is 


A (B,A)  >0  i = 1,2 N 


(3.42) 


with  inequality  holding  for  at  least  one  state  i. 


Proof . 


We  note  that  since  g = -I/7  In  A,  where  A is  the  maximal  eigen- 

B A 

value  of  Q,  then  we  need  to  prove  that  A > A for  7 < 0 and  vice 


An  important  result  from  matrix  theory  is  that,  if  Q is  a nonneg- 
ative irreducible  matrix  with  maximal  eigenvalue  A and  x is  a vector 
with  positive  components  x^ , then 


N N 

min  < A < max  

i Xi  i Xi 


(3.43) 
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with  equality  holding  if  and  only  if  x is  an  eigenvector  of  Q. 

Now  we  consider  the  implications  of  (3.42),  By  virtue  of  (3.41), 
we  can  write 

ti(B)  > t±(A) 

with  inequality  holding  in  at  least  one  state.  Using  (3.40),  this  re- 
duces to 

-(sgn  7)  ]>  q®  u*  > -(sgn  7)  z^  ^ qA  uA 

1 j=1  U J j=l  J J 

Since,  from  PE,  z and  7 have  different  signs,  the  product 
-(sgn  7)  z . is  always  positive,  and  we  can  divide  both  sides  of  the 
inequality  by  that  quantity  without  reversing  its  sense.  Also  from  PE, 

A 

U is  an  eigenvector  of  Q , whence 


N 

V 

.1=1 


B A ^ ,A  A 
q,  .u  . > A u . 

ij  .1  - J- 


(3.44) 


with  inequality  holding  in  at  least  one  state  i.  If  U is  also  an 

g 

eigenvector  of  Q , (3.44)  reduces  to 


with 


For 


For 


,B  A ,A  A 
A u.  > A u. 

,B  A 
A u . 
1 

. A A 
> A u. 

for 

some 

i 

7 < 0, 

u > 0 

and 

ab 

^A 

> A 

7 > 0, 

ui  < 0 

and 

.A 

< A 
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If  U is  not  an  eigenvector  of  Q , we  consider  y < 0 first,  then 


- > 0. 


For  7 < 0.  u >0,  and  (3.44)  can  be  rewritten  as 
1 


< B A 
N q . .u  . 

.1=1  1J  J . 

A 

u . 

l 


Since  this  holds  for  all  i,  then  it  is  certainly  true  that 


NBA 
\ - q.  .u  . 

min  \ -JL^1>XA 

i ja  < - 


Applying  (3.43)  to  the  L.H.S.  of  this  inequality,  noting  that  U 

B 

is  not  an  eigenvector  of  Q , we  get 


ab  >aa 


For  y > 0,  u.  < 0 and  (3.44)  can  be  rewritten  as 
1 


Hence , 


v-  B | A 


Since  this  holds  for  all  i,  it  is  certainly  true  that 
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r 


, ^ 


i 


max 

i 


B 


N 

y q 

JSi  iJ 


< A 


g 

Applying  (3.43)  to  the  L.H.S.  with  U not  an  eigenvector  of  Q gives 


A8  < aa . 


We  thus  have  a sufficient  condition  for  improving  g.  The  forego- 
ing proposition  states,  in  effect,  that  "improving"  the  test  quantity, 
as  defined  by  (3.40),  in  at  least  one  state  suffices  to  improve  g. 
Thus,  when  we  enter  BB,  we  will  discard  alternatives  for  which  A.  < 0. 
If  BB  yields  a policy  different  to  the  one  we  entered  with,  it  is  auto- 
matically an  improvement.  Otherwise,  the  policy  on  which  BB  converges 
maximizes  g over  the  set  of  feasible  policies  that  differ  with  it  in 
exactly  one  "coupled"  state. 

The  previous  proposition  also  guarantees  that  if,  for  a policy  A, 

each  test  quantity  is  maximum  in  its  state,  then  policy  A is  optimum. 

The  proof  is  identical,  except  that  equality  is  allowed  to  hold  in  all 

states.  Finally,  to  get  an  initial  feasible  policy,  we  maximize 
N 

-(sgn  7)  Ej.  q_  over  the  feasible  policy  set  by  BB.  Thus,  the  algo- 
rithm can  be  schematically  represented  in  Figure  3.1. 

The  convergence  proof  is  identical  to  that  of  Chapter  II.  If  we 
exit  the  algorithm  at  E2 , that  policy  is  the  overall  optimum.  No  policy 
can  have  a better  g.  (This  is  the  policy  PE-PI  would  converge  to.  In 
other  words,  the  policy  constraints  have  not  altered  the  optimum  policy .) 

If  we  exit  the  algorithm  at  E3 , then  any  feasible  policy  has  become 
infeasible  by  virtue  of  (2.71)  type  constraints.  This  means  that  it  be- 
longs to  a subset  over  which  we  maximized  g,  whence  it  cannot  have  a 
better  g than  the  one  we  obtained. 


i 
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Fig.  3.1.  ALGORITHM  FOR  RISK-SENSITIVE  CASE. 


E . The  Example 


We  will  solve  the  same  example  we  solved  in  Chapter  II,  introducing 


a risk  averse  coefficient  y = 0.01.  We  rewrite  the  original  policy- 
constraints  (2.72)  and  (2.73)  as 


d 


1 

1 


(3.45) 
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(3.46) 


i 


l 


2 ,2  , , 

dl  + d2  - 1 


To  get  an  initial  feasible  policy,  we  enter  BB  to  maximize 
over  the  feasible  policy  set.  The  BB  tree: 


-.1=1  qlJ 


T = -(0.923,0.852,0.933)  PQ1  = (1,1,1) 


Performing  PE  for  the  feasible  policy  (1,1,1)  results  in: 


State 


Ordered  Test  Quantities 


t.  = - 


2 3 

0.362  ^ t = -0.369  t 


t.,  = “ 


= -0.380 


0.173  ^ t = -0.181 


t„  = -' 


0.367  t*  = -0.369  = -0.381 

J 


A = (1,1,1)  g(A ) = 9.19 


Regular  PI  gives  (1,2,2),  an  infeasible  policy.  Maximizing  the  free 
state  gives  P(3)  = 2,  and  the  BB  tree  is: 


T01  = -(0.362,0.173)  Pm  = (1,2) 


01 


T,,  = -(0.362,0.181)  = (1,1) 


11 
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Thus,  we  get  policy  B = (1,1,2),  different  to  the  one  we  started  with 


Performing  PK  for  (1,1,2)  gives: 


State 


Ordered  Test  Quantities 


12  3 

1 t„  = -0.357  „ t.  = -0.364  t = -0.374 


1 


tj2  = -0.273  ' ''s  t*  = -0.2E6 


3 t^  = -0.268  t*  = -0.269  tjj  = -0.278 


A = (1,1,2)  g (A ) = 9.34 


Regular  PE  yields  (1,2,2)  which  violates  (3.45),  i.e., 

ible.  Hence,  we  maximize  in  3 to  get  P(3)  = 2,  and  we  enter 

A.  > 0 , i.e., 

i — 

State  Ordered  Test  Quantities 


The  BB  tree  is : 


T01  = -(°-357>()*273)  P01  = (1,2) 

T11  = -<0-357,0-286)  ^ = (1,1) 


is  infeas 
BB  with 


117 


Thus,  we  get  policy  (1,1,2),  the  same  one  we  entered  with.  Hence, 
our  optimum  policy  P^  and  g so  far  are 


Pq  = (1,1,2)  g * = 9.34 


We  add  the  constraint 


dJ  + d2  < ° 


(3.47) 


And  wo  enter  BB  with  no  restrictions  on  /A  , i.e., 


Ordered  Test  Quantities 


The  BB  tree  is : 


T = -(0.357,0.273)  P()1  = (1,2) 


T = -(0.374,0.273)  P = (3,2) 


Hence,  we  have  a policy  (3,2,2)  for  which  we  perform  PE: 
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Ordered  Test  Quantities 


State 

1 t^  = -0.091  t*  = -0.096  t^  = -0.099 

2 t^  = -0.656  t*  = -0.756 

3 tg  = -0.128  tg  = -0.134  t3,  = -0.147 

A = (3,2,2)  g(A)  = 12.40 

Regular  PE  yields  an  infeasible  policy.  Thus,  we  maximize  in  3 and 
enter  BB  with : 


State  Ordered  Test  Quantities 


1 


•> 


t 


3 

1 


The  BB  tree : 


I 


1 


— 

-(0.091,0.656) 

P 

= (2.2) 

01 

01 

1 

-(0.099,0.656) 

P 

= (3.2) 

11 

11 

And  we  got  (3,2,2)  the  policy  we  entered  with.  Since  its  g is 
greater  than  g* , 


we  update: 


Po  = (3,2,2)  g*  = 12.40 


We  add  the  constraint 


3 2 

d + d < 0 
1 ^ “ 


(3.48) 


And  we  enter  BB  with : 


State  Ordered  Test  Quantities 


The  BB  tree : 


TQ1  = -(0.091,0.656)  P = (2,2) 


No  feasible  policy  exists  in  the  tree,  i.e.,  the  problem  has  become 
infeasible.  Thus,  our  optimum  policy  and  g are 


P = (3,2,2)  g = 12.40 

o 


This  is  the  same  optimum  policy  as  in  Chapter  II,  the  risk-indifferent 
case.  Thus,  a small  risk-aversion  coefficient  of  0.01  does  not  change 
the  optimum  policy.  The  negligible  effect  of  such  a small  coefficient 
may  also  be  detected  by  regarding  the  risk  premium,  the  difference  be- 
tween expected  value  and  expected  utility  decision  making: 
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g*  - g*  = 12.774  - 12.40  = 0.374 

This,  however,  is  not  our  concern  here.  Our  main  objective  is  to  have 
an  algorithm  which  works  for  both  risk-indifferent  and  risk-sensitive 
cases.  This  objective  has  been  achieved. 


Chapter  IV 

SENSITIVITY  OF  OPTIMAL  POLICY  TO  CONSTRAINTS 


In  this  chapter,  we  investigate  the  effect  of  policy  constraints  on 
the  optimal  policy  and  how  much  a rational  decision  maker  would  be  wil- 
ling to  pay  in  order  to  remove  one  or  more  constraints. 

Our  point  of  departure  is  the  absence  of  policy  constraints.  In 
this  case,  no  policy  can  yield  a higher  gain  (or  certa  in  equivalent  gain) 
than  the  policy  Pq  arrived  at  by  Howard's  VD-PI  (or  PE-PI)  algorithm. 
When  policy  constraints  are  introduced,  policy  Pq  might,  or  might  not, 
become  infeasible.  If  the  set  of  policy  constraints  does  not  make  Pq 
infeasible,  it  is  still  the  optimal  policy,  and  the  constraints  are 
merely  a red  herring.  In  other  words,  we  can  tell  the  decision  maker 
that,  in  this  case,  his  concern  about  policy  constraints  is  much  ado 
about  nothing.  We  define  such  an  optimal  policy  as  a "constraint-indif- 
ferent" optimal  policy,  and  all  constraints  are  worthless,  in  the  sense 
that  the  decision  maker  has  nothing  to  gain  by  removing  any  of  them. 
Moreover,  we  do  not  have  to  solve  the  problem  in  the  absence  of  policy 
constraints  to  detect  constraint-indifference.  The  algorithm  we  devel- 
oped has  the  ability  to  detect  that,  as  we  showed  in  the  convergence 
proofs.  If  we  exit  the  algorithm  from  E2 , we  have  a constraint-indiffer- 
ent optimal  policy.  The  distinguishing  feature  of  the  E2  exit  is  that, 
for  the  selected  policy,  each  alternative  maximizes  the  test  quantity 
in  its  state.  Therefore,  any  other  policy  results  in  nonpositive  y , 
whence  the  gain  cannot  increase. 

If  we  exit  the  algorithm  from  E3 , we  have  what  we  define  as  a "con- 
straint-sensitive" optimal  policy.  Here,  the  policy  having  the  highest 
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possible  gain  is  infeasible.  The  constraints  have  affected  the  optimal 
policy.  If  we  look  at  the  table  of  ordered  test  quantities  for  the  last 
iteration,  there  will  be  at  least  one  state  in  which  the  selected  alter- 
native does  not  maximize  the  test  quantity  (otherwise,  we  would  have  ex- 
ited from  E2).  For  such  states,  the  only  thing  that  prevented  BB  from 
maximizing  the  test  quantities,  is  the  feasibility.  Thus,  it  might  be 
worthwhile  to  pay  for  removing  some  constraints.  There  is,  however,  a 
subset  of  constraints  (possibly  the  null  set)  which  do  not  affect  the 
optimal  policy  and  can  be  determined  from  the  final  (i.e.,  last  itera- 
tion) table  of  ordered  test  quantities.  Assume  that  the  coupled  states 
are  numbered  1,2,...,M  < N,  and  that  the  alternatives  selected  by  the 
optimal  policy  in  those  states  are  a,b,c respectively. 

We  start  by  listing  the  table  of  ordered  test  quantities  for  the 
coupled  states  of  the  optimal  policy.  Without  loss  of  generality,  we 
have  renumbered  the  alternatives  in  each  state  according  to  the  descend- 
ing order  of  their  test  quantities. 


The  line  we 
restriction 


have  drawn  through  the  table  is  the  line  representing  the 


7^  > 0.  No  alternatives  to  the  right  of  this  line  are 
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considered  in  BB.  Our  claim  is  that,  if  the  d.  involved  in  a con- 

1 

straint  are  all  to  the  right  of  that  line,  the  optimal  policy  is  not 
affected  by  that  constraint.  This  is  due  to  the  fact  that  only  y >0 
can  yield  a better  policy,  and  all  such  policies  (to  the  left  of  the 
line)  are  infeasible  (otherwise,  we  would  have  converged  to  the  best). 

Finally,  the  alternatives  selected  by  the  optimal  policy  in  the 
coupled  states  are  given  by 


P(l)  = a P(2)  = b P(M)  = 


Each  policy  constraint  C (d1?)  < 0 (or  =0)  involves  dk  such 

pi-  l 

that  i G {1,2,...,M}.  Now  consider  the  subset  of  constraints 


" {cp(di)  : 


k > P(i) 


Then  C is  composed  of  constraints  which  do  not  affect  the  optimal 
policy,  i.e.,  if  they  are  discarded,  the  solution  would  not  change. 
The  proof  is  simple.  If  a constraint  involves  d^  such  that  k>P(i) 
for  all  i,  then  the  extra  feasible  policies  resulting  from  discarding 
that  constraint  have  negative  ?^'s  (to  the  right  of  the  line),  whence 
their  gain  is  inferior  to  the  policy  we  already  obtained.  Consequently, 
the  algorithm  would  not  converge  to  any  of  them.  It  would  still  con- 
verge to  the  same  optimal  policy. 


Thus,  when  the  algorithm  terminates,  we  can  immediately  determine 


whether  or  not  there  are  worthless  constraints.  If  we  exit  from  E2 , we 
have  a constraint-indifferent  optimal  policy.  If  we  exit  from  E3 , we 
can  look  at  the  table  of  ordered  test  quantities  and  detect  those  con- 


straints that  the  decision  maker  need  not  have  concerned  himself  with. 
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At  this  point,  he  should  not  be  willing  to  pay  for  removing  anyof  them; 

he  would  not  gain  anything.  As  for  the  remainder  of  the  constraints, 

some  of  them  might  also  be  worthless,  but  some  of  them  definitively  are 
not.  To  discover  the  worth  of  any  single  constraint,  we  can  remove  it 
and  solve  the  problem  again,  starting  with  the  optimal  policy  as  our 
initial  feasible  policy.  This  we  do  for  the  taxicab  example  in  the 
risk-indifferent  case.  There,  our  policy  constraints  were 

d*  - d*  = 0 (4.1) 

d^  + df2  < 1 (4.2) 

We  converged  to  policy  Pq  = (3,2,2)  and  gain  g = 12.77. 


State  Ordered  Test  Quantities 


Pq  = (3,2,2)  g = 12.77 

Here,  we  have  drawn  the  y > 0 line  and  the  constraints.  Neither 
constraint  involves  d.^  which  are  all  to  the  right  of  that  line.  Thus, 
we  do  not  have,  as  yet,  any  worthless  constraints.  Let  us  see  what 
happens  if  we  discard  (4.1).  This  means  that  the  union  drop  the  re- 
striction of  using  its  facilities  but  still  allows  only  one  taxi-cab 
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stand  to  be  used.  Our  initial  feasible  solution  is  (3,2,2).  We  note 
that  no  improvement  can  be  made  in  the  free  state,  and  maximizing  over 
the  coupled  ones  results  in  the  infeasible  policy  (2,2,2).  Hence,  we 
enter  BB  with  the  next  table.  (We  shuffled  the  states  around  to  have 
the  one  with  least  alternatives  as  the  first  component.  This  reduces 
the  BB  computations.) 


State 


Ordered  Test  Quantities 


The  BB  tree  is : 


T01  = (1.04,20.69)  P01  = (2,2) 


T = (0.58,20.69)  = (1,2) 


Hence,  we  move  to  (1,2,2),  a feasible  policy  different  to  the 


one  we  entered  with . Performing  VD  gives  us 
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State 


Ordered  Test  Quantities 


t = 1.47 


t^  = 1.12 


t = 0.59 


t = 20.48 


t = 11.08 


t3  = 1.2 


t3  = 0.84 


t3  =0.22 


P = (1,2,2)  g = 13.15 


No  improvement  is  possible  via  maximizing  test  quantities,  so  we  enter 
BB  with 


State  Ordered  Test  Quantities 


2 


1 


t 


1 

1 


The  BB  tree  is  identical  to  the  previous  one,  i .e . , we  converge  to 
(1,2,2).  Therefore,  we  introduce  the  constraint 


d 


1 

1 


< 0 


and  we  set  our  optimal  policy  so  far  as 


Pq  = (1,2,2)  g*  = 13.15 


State 


Ordered  Test  Quantities 


The  BB  tree  is : 


T0i  = (2.44,8.87)  PQ1  = (1,2) 

Tlt  = (2.05,6.61)  Pxi  = (2,1) 


And  we  have  converged  to  (2,1,2).  Since  g < g* , our  previous  optimal 
policy  is  still  optimal  so  far.  We  introduce  the  constraint 


dl  + d2  < ° 

and  enter  BB  with 


(4.4) 


State  Ordered  Test  Quantities 


The  BB  tree  is : 


T01  = <2-44-8-87>  P01  = (1,2) 
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The  only  node  In  the  tree  is  infeasible,  and  no  branching  is  possi- 
ble, i.e.,  an  infeasible  problem.  Thus,  we  have  exhausted  the  set  of 
feasible  policies  and  exited  from  E3 . Our  necessarily  constraint-sensi- 
tive optimal  policy  is 


P = (1,2,2)  g = 13.15 

The  difference  in  gains  between  this  optimal  policy  and  that  in  the 
presence  of  (4.1)  is 


= 13.15  - 12.77  = 0.38 

Therefore,  it  is  not  worth  more  than  0.38  units  per  transition  for 
the  taxi-cab  driver  to  try  removing  (4.1).  For  instance,  if  there  were 
a proposition  in  the  union  for  raising  the  dues  in  return  for  abolishing 
the  rule  that  forces  a member  to  use  union  facilities,  he  should  not  vote 
for  it  if  the  increase  in  dues  averages  more  than  0.38  per  trip.  Other- 
wise, he  votes  for  the  proposition.  He  stands  to  gain  by  changing  his 
optimal  policy  under  the  new  rules  of  the  game.  Note  that  there  still 
is  a more  gainful  policy  if  (4.2)  were  also  discarded  (the  constraint- 
indifferent  policy;  here,  it  would  be  constraint-indifferent  by  default 
because  there  would  be  no  constraints).  We  know  about  the  existence  of 
that  policy  from  the  fact  that  we  did  not  exit  from  E2 . 

We  could  discard  (4.2)  and  start  with  the  policy  (1,2,2)  to  get 
the  optimal  one,  thus  computing  how  much  that  constraint  is  worth  in  the 
absence  of  (4.1).  A more  interesting  thing  happens,  however,  if  we  re- 
tain (4.1)  and  discard  (4.2).  In  other  words,  if,  after  solving  the 
problem  in  the  presence  of  both  (4.1)  and  (4.2)  we  want  to  know  how  much 
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(4.2)  is  worth,  we  would  start  with  policy  (3,2,2)  and  only  (4.1)  as  a 


constraint.  We  repeat  the  policy's  table  here. 


State 


Ordered  Test  Quantities 


t“  = 1.04 


t“  = 20.69 


tx  = 0.58 


X = 9.09 


t3  = 1.51 


t3  =0.95 


Pq  = (3,2,2)  g = 12.77 


Here,  no  improvement  can  be  made  in  the  free  state,  but  we  can  maximize 


over  the  coupled  states  to  get : 


State 


Ordered  Test  Quantities 


= 0.82  t*  = 0.71  t?  = 0.37 

1 L n 1 


tg  = 22.29  t*  = 13.21 


t3  = 1.01 


t3  = 0.75 


t3  = 0.33 


P = (2,2,2)  g = 13.34 


Here,  the  test  quantities  are  all  maximum  in  their  states  under  this 
policy,  and  we  exit  from  E2 . Hence,  (2,2,2)  is  the  best  policy  he  can 
ever  implement.  It  is  also  constraint-indifferent.  In  other  words,  it 
is  only  the  rule  that  only  one  union  taxicab  stand  can  be  used  that  need 


concern  him.  Its  value  is 
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r 


n 


i 


Zg2  = 13.34  - 12.77  = 0.57 

Note,  however,  that,  In  the  absence  of  constraint  (4.1),  constraint 
(4.2)  Is  worth 


/g  = 13.34  - 13.15  = 0.19 


This  is  arrived  at  by  the  fact  that,  in  the  absence  of  (4.1),  we 
converged  to  a gain  of  13.15.  Had  we  then  discarded  (4.2)  and  started 
with  the  then  optimal  policy  (1,2,2),  we  would  have  converged  to 
(2,2,2).  The  values  naturally  turn  out  to  be  additive.  (Otherwise,  we 
would  have  a "money  pump"  situation).  In  other  words,  a constraint  does 
not  usually  have  a value  independent  of  other  constraints.  There  could 
be,  however,  a constraint  (or  group  of  constraints)  that  renders  the 
others  worthless.  In  our  example,  constraint  (4.2)  was  such  a constraint. 
If  that  constraint  representing  the  one-stand -only  rule  were  removed, 
the  other  constraint  is  worthless.  To  achieve  that,  a value  of  0.57  per 
transition  is  an  upper  bound.  If  only  (4.1)  were  removed  at  a cost  of 
0.38  per  transition,  then  removing  (4.2)  would  be  worth  0.19.  Thus,  to 
get  the  constraint-indifferent  optimal  policy,  our  friend  could  do  one 
of  two  things.  He  could  expend  up  to  0.38  per  transition  to  remove  the 
use-the-union-facilities  rule  and  up  to  0.19  per  transition  to  remove 
the  one-stand-only  rule  for  a total  of  0.57  per  transition.  Alterna- 
tively, he  could  expend  up  to  0.57  per  transition  to  remove  the  one- 
stand-only  rule  and  not  bother  about  the  first  rule.  In  either  case, 
there  is  an  (identical)  upper  bound  on  how  much  he  should  be  willing  to 
pay  to  achieve  that  constraint-indifferent  optimal  policy. 
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In  general,  it  is  not  that  simple  to  discover  combinations  that 


achieve  the  constraint-indifferent  optimal  policy.  The  fact  is,  how- 
ever, that  no  matter  how  it  can  be  achieved,  there  is  a unique  upper 
bound  on  the  value  of  doing  so,  namely  the  difference  in  gain  between 
the  constraint-indifferent  optimal  policy  and  the  constraint-sensitive 
optimal  policy  given  by  the  algorithm  when  all  constraints  are  taken 
into  consideration.  If  the  decision  maker  is  interested  in  the  afore- 
mentioned upper  bound,  we  could  start  discarding  constraints,  one  at  a 
time,  and  solving  the  problem  starting  with  the  latest  constraint-sen- 
sitive optimal  policy  as  an  initial  feasible  policy  until  we  eventually 
get  the  constraint-indifferent  optimal  policy.  We  can  give  him  two 
things  here.  First,  the  breakdown  of  the  upper  bound  between  con- 
straints. Secondly,  we  tell  him  that,  no  matter  what,  the  constraint- 
indifferent  optimal  policy  is  the  best  he  can  ever  hope  to  achieve  for 
the  given  problem. 

If  the  decision  maker  is  interested  only  in  specific  constraints, 
or  groups  thereof,  we  can  compute  the  worth  of  such  constraints  by  the 
afore  mentioned  technique  ; it  being  understood  that  the  worth  we  compute 
of  specific  constraints  is  subject  to  the  presence  of  the  remainder  of 
the  constraints. 

The  problem  of  determining  specifically  which  constraint,  or  group 
of  constraints,  render  the  rest  worthless  (i.e.,  d iscard ing  them  results 
in  a constraint-indifferent  optimal  policy) , or  even  the  minimal  such 
group,  is  an  essentially  combinatorial  problem  worthy  of  research. 
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Chapter  V 


MODIFICATIONS  FOR  PROBLEMS  WITH  A LARGE  NUMBER  OF  STATES 

A . Introduction 

As  explained  earlier,  for  a constraint-sensitive  optimal,  we  have 
to  exhaust  the  feasible  policy  set.  That  set  becomes  very  large  as  the 
number  of  states  increases.  At  the  same  time,  the  rate  at  which  it  is 
exhausted  is  slow.  This  we  discuss  in  further  detail  in  Section  B.  In 
Section  C,  we  introduce  the  idea  of  partitions,  i.e.,  dividing  the  cou- 
pled states  into  groups  which  have  no  intergroup  couplings.  We  show  how 
partitioning  increases  the  rate  of  exhausting  the  feasible  policy  set, 
discuss  the  problems  introduced  by  partitioning,  and  how  to  overcome 
them.  In  Section  D,  we  elaborate  on  how  to  deal  with  transient  states 
if  we  have  a single  trapping  state  (whence  all  policies  have  the  same 
gain).  In  Section  E,  we  discuss  a modification  which  might  further 
speed  up  the  algorithm.  Finally,  in  Section  F,  we  introduce  various 
constraints  to  Howard's  baseball  problem  [41  and  give  the  computational 
results . 

B . Of  Dimensions  and  Exhaustion 

As  the  "dimension"  of  the  problem,  i.e.,  the  number  of  states  in- 
creases, the  number  of  policies  increases  multipl icatively . This  is 

N 

because  the  latter  is  given  by  . Thus,  adding  one  state  with  3 

alternatives,  say,  increases  the  number  of  policies  by  a multiplicative 
factor  of  5.  While  it  is  true  that  the  constraints  eliminate  some  of 
these  policies,  we  can  still  consider  that,  in  general,  we  have  a fairly 
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large  number  of  feasible  policies  to  consider.  In  the  case  of  a con- 


straint-sensitive optimal  policy,  we  have  to  exhaust  the  feasible  policy 
set  (exit  E3  in  the  algorithm).  As  we  have  seen,  whenever  branch  and 
bound  converges  on  a policy,  that  policy  maximizes  the  gain  over  the 
subset  of  policies  differing  with  it  in  exactly  one  coupled  state.  That 
subset  is  discarded,  and  exhaustion  occurs  when  the  union  of  such  sub- 
sets covers  the  feasible  policy  set.  Now,  if  we  consider  the  number  of 

policies  in  any  one  of  the  afore-mentioned  subsets,  we  find  that  it  is 
N 

given  by  E^  - N + 1 . (Here,  we  assume,  for  illustrative  purposes, 
that  all  the  N states  are  coupled.)  Of  this  number  of  policies,  some 
might  already  be  infeasible,  either  due  to  the  original  constraints  or 
type  (2.71)  constraints.  Hence,  the  exhaustion  process  is,  at  best,  ad- 
ditive in  nature;  whereas  the  set  to  be  exhausted  has  a size  whose  na- 
ture is  multiplicative.  In  other  words,  the  feasible  set  is  being  ex- 
hausted at  too  slow  a rate.  It  thus  appears  that,  in  the  absence  of 
additional  structural  properties,  the  computational  cost  of  the  algo- 
rithm would  be  prohibitive  for  problems  with  a large  number  of  states. 

C . Of  Partitions 

For  practical  problems  with  a large  number  of  states,  it  might  very 
well  turn  out  that  the  coupled  states  can  be  divided  into  groups  having 
no  inter-group  coupling  of  alternatives.  (We  will  later  give  examples.) 
Those  groups  we  term  "partitions."  We  thus  define  a partition  as  a group 
of  coupled  states  in  which  no  alternative  in  any  state  appears  in  a con- 
straint involving  alternatives  in  any  state  not  in  the  partition.  For- 
mally: Let  the  mth  constraint  (C  ) be  defined  by  the  set  of  ordered 

m 

pairs  (i,k)  referring  to  the  d^'s  involved  in  the  constraint,  i.e., 
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cm  = ,k1(m)J  , J^i^(m)  »k^(m) 

2 12 

e.g.,  for  d1  + d3  + d5  > 1, 

C = {(1,2), (3,1), (5,2)  ) 
th 

Let  the  r partition  be  defined  by  its  constituent  states: 

Pr  = {il(r) Vr)} 

Then  our  definition  becomes 

i £ P <5=>  {(i,k)  £ C =>  i.(m)  £ P vfi  . (m)  ,k  . (m)  £ C 

r | rn  .1  r L j ’ j J m 

We  have  thus  characterized  the  states  further.  First,  we  divide  them 
into  free  and  coupled  states.  Then  the  coupled  states  are  further  de- 
composed into  partitions  (whence  a partition  may  be  regarded  as  a "gen- 
eralized free  state").  The  advantage  of  partitioning  is  explained  by 
the  following  proposition. 

Proposition  5.1. 

Let  the  maximization  of  the  Lagrangian  over  the  set  of  feasible 
policies  subject  to  > 0 yield  a policy  P for  which  all  = 0 
(i.e.,  branch  and  bound  converged  to  P) . Then  P maximizes  the  gain 
over  the  set  of  all  feasible  policies  that  differ  with  P in  at  most 
one  state  in  each  partition. 
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Proof . 


I 

5 


t 


» 


t 

dBSL 


Let  P'  be  a feasible  policy  differing  with  P in  at  most  one 
state  in  each  partition.  First,  recall  that 

N 

V 

g(P')  - g(P)  = /^g  = jt.(P')  7 (2.66) 

i=l  1 1 

Now  consider  each  partition.  If  P'  is  identical  with  P throughout 
a partition,  then  7^  = 0 for  all  states  in  that  partition.  If  P' 
differs  with  P in  exactly  one  state  in  a partition,  then  7^  < 0 for 
that  state.  (Otherwise,  the  Lagrangian  could  have  been  improved,  and 
that  did  not  happen.)  As  for  the  free  state,  7.  =0.  (The  alterna- 
tives there  maximize  the  test  quantities.)  Therefore,  7^  < 0 for  all 
states,  whence  Z'g  < 0 and  P cannot  be  inferior,  in  gain,  to  P' . 

The  importance  of  the  previous  result  is  that,  with  partitions,  the 

rate  of  exhausting  the  feasible  set  is  greatly  enhanced.  The  number  of 

policies  in  a subset  discarded  by  a policy  P has  a rough  estimate  of 
R 

IL  [,  K.  -N  +H,  where  R is  the  number  of  partitions  and  N 

r=l  i £ Pr  1 r r 

is  the  number  of  states  in  partition  P . Actually,  the  previous  esti- 
mate is  on  the  high  side  because  we  have  to  subtract  from  it  those  com- 
binations that  cause  a change  in  more  than  one  partition.  However,  that 
estimate  serves  to  illustrate  the  fact  that  the  rate  of  exhaustion  is 
better  than  additive,  a definite  improvement  on  the  case  where  no  par- 
titions exist.  This  contention  has  been  borne  out  in  the  computational 
results  . 

The  introduction  of  partitions  complicates  matters  slightly.  First, 
we  have  the  problem  of  additional  (2.71)  type  constraints,  to  impleme  it 
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discarding  subsets  over  which  we  maximize  the  gain.  We  could  add  a con- 


r 


straint  for  each  partition.  In  this  case,  however,  we  would  have  to 
consider  combinations  of  them  to  detect  whether  or  not  a given  policy 
belongs  to  a discarded  set.  This  problem  is  resolved  by  applying  the 
result  of  the  proposition  directly  without  resort  to  formal  constraints 
of  type  (2.71).  Every  time  branch  and  bound  converges  to  a policy;  that 
policy  is  retained  in  lieu  of  a constraint.  Then,  given  any  policy,  it 
is  compared  to  the  retained  policies.  If  the  given  policy  differs  with 
a retained  policy  by,  at  most,  one  alternative  in  each  partition,  that 
given  policy  belongs  to  a set  over  which  we  have  maximized.  It  is  ig- 
nored, i.e.,  considered  infeasible. 

The  second  problem  that  arises  is  how  to  detect  that  the  set  of 
feasible  policies  has  been  exhausted.  Whenever  BB  converges  to  a nec- 
essarily feasible  policy  P,  we  enter  BB  with  no  restriction  on  y 
for  each  partition  in  turn,  all  other  partitions  being  held  constant 
(i.e.,  no  change  in  alternatives).  If  BB  finds  the  problem  infeasible 
for  one  partition,  this  does  not  imply  exhaustion.  Only  if  all  parti- 
tions yield  an  infeasible  problem,  is  the  feasible  policy  set  exhausted . 
We  have,  however,  to  prove  this  assertion. 

Assume  that  for  policy  P,  all  partitions  yield  an  infeasible 
problem.  Assume  that  there  exists  a feasible  policy  P’  whose  gain  is 
greater  than  the  optimum  obtained  so  far.  This  implies  that  P'  does 
not  belong  to  any  set  over  which  we  have  maximized  the  gain.  This,  in 
turn,  implies  that  for  each  retained  policy  R,  there  exists  at  least 
one  partition  in  which  R and  P'  differ  in  more  than  one  state.  This, 
of  course,  applies  to  P.  Now  consider  P' . There  exists  a partition 
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P.  and  a retained  policy  R^  such  that  P'  and  R^  differ  in  more 
than  one  state  in  P^  (Otherwise,  P'  would  belong  to  a discarded 
set.)  Now  consider  a policy  which  is  identical  to  R^  in  all 

states  except  those  in  P.  and  identical  to  P'  in  P. . L.  differs 
with  R^  in  more  than  one  state  in  P^ , whence  it  does  not  belong  to 
the  set  discarded  by  R1 . But  BB  did  not  yield  us  . 

Therefore,  there  must  exist  a retained  policy  R^  defining  a dis- 
carded set  to  which  L belongs.  In  partition  P.,  L and  R , and 

1 11  £ 

therefore  P'  and  R,^,  differ  in  at  most  one  state.  Thus,  there  ex- 
ists a partition  P.  in  which  P'  and  R,  differ  in  more  than  one 
J ^ 

state.  Now  consider  policy  identical  to  R^  in  all  partitions 

except  P.  and  identical  to  P'  in  P..  Thus,  L„  and  P'  are  iden- 
J 3 2 

tical  partitions  P^  and  P^ . BB  did  not  yield  this  policy,  whence  it 
belongs  to  a discarded  set  defined  by  a retained  policy  R^ . Continuing 
in  this  manner,  we  finally  form  p'  and  find  that  it  belongs  to  a sub- 
set over  which  we  have  maximized,  whence  its  gain  cannot  exceed  that 
already  obtained.  Thus,  we  have  exhausted  the  feasible  policy  set. 

D . Of  Trapping  States 

In  some  problems,  such  as  Howard's  baseball  example,  there  is  one 
trapping  state,  whence  all  others  are  transient.  In  this  case,  all  pol- 
icies have  the  same  gain.  Howard  shows,  however,  that  his  VD-PI  algo- 
rithm improves  the  relative  values  every  iteration  [4],  Since  we  have 
retained  Howard's  sufficient  condition  ? >0,  this  still  applies. 

However,  when  we  have  a constraint-sensitive  optimal,  we  are  maximizing 
over  subsets.  Whenever  we  detect  such  a maximum,  we  compare  its  gain 
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with  the  current  optimum.  If  the  gains  are  equal,  as  they  are  here,  we 

need  a criterion  for  selecting  an  optimum.  We  adopt  the  criterion  de- 

N-l 

veloped  by  Nesbitt  [101  for  such  cases,  where  (3^v^  is  optimized, 

Pi's  being  the  given  initial  state  probabilities.  This  maximizes  the 
asymptotic  expected  reward. 

E . Of  Speeding  Up  the  Algorithm  \ 

As  mentioned  earlier,  a computationally  burdensome  approach  to  the 
constrained  Markov  Decision  Processes  problem  would  be  to  obtain  the 
unconstrained  optimum  by  Howard's  VD-PI  algorithm,  then  proceed  back- 
wards in  the  policy  ordering,  checking  feasibility.  Our  algorithm  at- 
tacks the  problem  directly  by  only  considering  feasible  policies  and 
exhausting  the  feasible  policy  set.  We  can  also  obtain  constraint- 
indifferent  optimum  without  having  to  exhaust  the  set.  A moddle-of- 
the-ground  approach  would  be  to  obtain  the  unconstrained  optimum  by 
VD-PI,  then  start  our  algorithm  from  there.  The  initial  feasible  policy 
is  obtained  by  branch  and  bound  with  no  restrictions  on  7 , from  the 
table  of  test  quantities  rather  than  from  the  q^  While  we  still  would 
have  to  exhaust  the  feasible  policy  set  for  a constraint  sensitive  op- 
timum, we  have  a better  chance  of  starting  at  a policy  having  a high 
gain.  While  it  is  true  that  we  perform  extra  VD's  in  the  beginning,  it 
is  hoped  that  this  will  be  of  fset  by  performing  less  BB  iterations  later  on . 

F.  Baseball  Example  and  Computational  Results 

All  the  above  considerations  were  applied  to  Howard's  baseball  ex- 
ample [41  with  constraints  imposed  on  the  policies,  the  constraints 
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being  various  bright  ideas  imposed  on  the  manager  by  the  club's  eccentric 


I 


I 


f 

i 


owner . 


Table  5.1  gives  the  results  for  a simple  constraint: 


d2  + d3  < 1 


(5.1) 


This  yields  a constraint-indifferent  optimal  policy.  Another  simple  con- 
straint yields  the  results  given  in  Table  5.2: 


d2  + d3  - 1 


(5.2) 


In  both  cases,  we  have  only  one  partition. 

Encouraged  by  the  results  his  team  achieves,  the  owner  wants  to  try 
new  ideas.  The  manager  knows  better  but  wants  to  keep  his  job,  so  he 
compromises.  They  agree  that,  irrespective  of  how  many  men  are  out,  the 
owner's  ideas  are  to  be  carried  out  only  if  there  is  a man  on  third  base 
and  the  bases  are  not  loaded.  With  no  one  out,  the  owner  wants  a hit 
decision  in  at  least  2 of  the  3 possible  situations.  Likewise,  with  one 
man  out  and  two  men  out.  This  translate?  immediately  to  the  following 
constraints : 


dr  + + d7  > 2 

5 6 7 — 


(5.3) 


dJ3  + d14  + dJ5  > 2 


(5.4) 


d21  + d22  + d23  - 2 


(5.5) 
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Table  5.1 


BASEBALL  PROBLEM 
(Transient — One  Constraint) 


j Number  of  policy  constraints  = 1 

10 

Number  of  feasible  policies  = 2.2  x 10 
I Number  of  iterations  = 2 

j Optimal  gain  =0.0 

Optimal  policy  type  = constraint-indifferent 


— 
Dec is  ion 

1 

Value  v. 

l 

Hit 

0.81 

Hit 

1.24 

Hit 

1.32 

Hit 

1.88 

Hit 

1.56 

Hit 

2.07 

Hit 

2.16 

Hit 

2.73 

Hit 

0.45 

Hit 

0.77 

Hit 

0.87 

Hit 

1.23 

Hit 

1.10 

Hit 

1.44 

Hit 

1.53 

Hit 

1.95 

Hit 

0.17 

Hit 

0.33 

Hit 

0 .39 

Hit 

0.58 

Hit 

O 

LO 

o 

Hit 

0.67 

Hit 

0.73 

Hit 

0.98 

Trapped 

0.0 
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That  is  not  the  whole  story,  however.  The  owner  has  additional  bright 
ideas.  If  the  manager  decides  to  hit  with  a man  on  first  base,  then  he 
has  to  hit  with  a man  on  second.  The  corresponding  constraints  are: 


d 


d 


d 


d 


d6  * 4 


1 3 

d6  +d7 


1 ^ 

14  + d15 


1 

14  + d15 


1 .2 

22  + d23 


1 

+ d 

22  23 


< 1 


< 1 


< 1 


< 1 


< 1 


< 1 


(5.6) 

(5.7) 

(5.8) 

(5.9) 

(5.10) 

(5.11) 


Finally,  the  manager  cannot  decide  to  hit  with  a man  on  second  base  un- 
less he  decides  the  same  with  one  on  first  and  second.  This  translates 
into : 


d^  + d*  < 1 (5.12) 

d^  + d*  < 1 (5.13) 

O 7 — 


2 ^ 
d 13  + d 1 5 - 1 


3 4 , , 

d 13  + d15  - 1 


2 A 

d21  + d23 


d21  + d23 


(5.14) 

(5.15) 

(5.16) 

(5.17) 
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Constraints  (5.3)  through  (5.17)  decompose  into  three  natural  partitions 
corresponding  to  how  many  men  are  out.  Table  5.3  gives  the  results  under 
these  conditions.  Note  that  the  optimal  is  constraint-indifferent.  In 
other  words,  all  the  fuss  the  owner  made  is  really  irrelevant.  The  man- 
ager does  exactly  what  he  would  have  done  without  any  interference  from 
the  owner.  However,  neither  of  them  realizes  that,  and  the  team  contin- 
ues to  win,  which  makes  the  owner  come  up  with  even  more  ideas.  The  man- 
ager salvages  freedom  of  action  only  if  the  bases  are  loaded  or  nobody 
is  on.  In  addition  to  the  previous,  the  owner  imposes  restrictions  when- 
ever nobody  is  on  third.  He  wants  the  decision  to  be  a steal  in  at  least 
two  of  every  three  possible  situations.  This  leads  to: 


3 3 3 

d2  + d3  + d4  £ 2 


(5.18) 


d10  + dll  + d12  - 2 


(5.19) 


d 18  + d19  + d20  - 2 


(5.20) 


Moreover,  if  he  decides  to  steal  second  with  one  man  on,  he  cannot  .hit 
or  bunt  (i.e.,  must  steal  third)  with  two  men  on: 


+ d 


+ d , 


+ d 


12 


+ d 


12 


< 1 


< 1 


< 1 


< 1 


(5.21) 

(5.22) 

(5.23) 

(5.24) 
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(5.25) 


3 

d18  + d20  ^ 1 


3 2 

d18  + d20  < 1 


(5.26) 


With  less  than  two  men  out,  he  cannot  decide  to  steal  with  two  men  on  if 
he  decides  to  hit  or  bunt  with  a man  on  second: 


1 3 

d3  + d4  < 1 


(5.27) 


2 3 

d3  + d4  < 1 


(5.28) 


dii + dL  < 1 


(5.29) 


dii  + <2  < 1 


(5.30) 


With  two  men  out,  the  rule  changes  to  not  steal  with  two  men  on  if  hit 
or  bunt  with  a man  on  first: 

di8  + 4)  < 1 (5-3i) 

4 -20  < 1 (5-32) 

Table  5.4  gives  the  results  of  the  problem  subject  to  constraints  (5.3) 
through  (5.32). 

In  all  the  previous,  we  retained  the  original  structure  of  the  prob- 
lem, namely  a single  trapping  state  (state  25,  three  men  out).  To  select 
among  policies,  we  used  an  initial  state  probability  distribution  3 = 1 , 
3^=0  for  i / 1.  This  means  always  starting  in  state  1 (no  men  out, 
no  men  on).  The  same  problems  were  run  with  3^  = 1/24  (equally  likely 
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to  start  anywhere),  and  the  results  were  identical.  To  test  the  algo- 
rithm on  a problem  with  many  states  that  are  recurrent,  we  changed  the 
P of  the  trapping  state.  We  made  state  25  return  to  state  1 (i.e., 

a new  inning)  with  probability  1.  Tables  5.5,  5.6,  and  5.7  give  the 
results  of  the  recurrent  problems  subject  to  constraints  (5.2),  (5.3) 
through  (5.17),  and  (5.3)  through  (5.32),  respectively.  Finally,  we  ran 
the  algorithm  in  the  manner  described  in  Section  E for  both  the  recur- 
rent and  transient  baseball  problems,  and  a slight  improvement  in  exe- 
cution time  was  noticed.  Tables  5.8  and  5.9  give  the  results  of  the 
two  problems,  respectively. 

The  computational  results  obtained  indicate  to  us  that  we  have  a 
computationally  efficient  algorithm  for  Markov  Decision  Processes  with 
constraints  when  the  number  of  states  is  large. 


Table  5.5 

BASEBALL  PROBLEM 
(Recurrent — One  Constraint) 


Number  of  policy  constraints  = 1 
Number  of  feasible  policies  = 2.2  x 10^ 

Number  of  iterations  = 4 

Optimal  gain  = 0.1373 

Optimal  policy  type  = constraint-sensitive 

State 

Decision 

Value  v 

1 

Hit 

0.13 

2 

Hit 

0.62 

3 

Bunt 

0.45 

4 

Hit 

1.26 

5 

Hit 

0.91 

6 

Hit 

1.44 

7 

Hit 

1.52 

8 

Hit 

2.11 

9 

Hit 

0.01 

10 

Hit 

0.35 

11 

Hit 

0.43 

12 

Hit 

0.81 

13 

Hit 

0.68 

14 

Hit 

1.02 

15 

Hit 

1.11 

16 

Hit 

1.53 

17 

Hit 

-0.05 

18 

Hit 

0.11 

19 

Hit 

0.17 

20 

Hit 

0.36 

21 

Hit 

0.27 

22 

Hit 

0.45 

23 

Hit 

0.51 

24 

Hit 

0.76 

25 

New  Inning 

0.0 

Table  5.6 


BASEBALL  PROBLEM 
(Recurrent — Fifteen  Constraints) 


Number  of  policy  constraints  = 15 

7 

Number  of  feasible  policies  =3.4  X 10 

Number  of  iterations  = 2 

Optimal  gain  = 0.1406 

Optimal  policy  type  = constraint-indifferent 

State 

Decision 

Value  v. 

1 

1 

Hit 

0.14 

2 

Hit 

0.61 

3 

Hit 

0.68 

4 

Hit 

1 .24 

5 

Hit 

0.91 

6 

Hit 

1.43 

7 

Hit 

1.52 

8 

Hit 

2.09 

9 

Hit 

0.001 

10 

Hit 

0.34 

11 

Hit 

0.42 

12 

Hit 

0.80 

13 

Hit 

0.67 

14 

Hit 

1.01 

15 

Hit 

1.10 

16 

Hit 

1.32 

17 

Hit 

-0.06 

18 

Hit 

0.10 

19 

Hit 

0.16 

20 

Hit 

0.35 

21 

Hit 

0.27 

22 

Hit 

0.44 

23 

Hit 

0.50 

24 

Hit 

0.75 

25 

New  Inning 

0.0 

Number  of  policy  constraints  = 30 

4 

Number  of  feasible  policies  = 4.6  X 10 

Number  of  iterations  = 2 

Optimal  gain  = 0.1017 

Optimal  policy  type  = constraint-sensitive 

State 

Decision 

Value 

1 

Hit 

0.10 

2 

Hit 

0.42 

3 

Steal  3 

0.31 

4 

Steal  3 

0.64 

5 

Hit 

0.89 

6 

Hit 

1.33 

7 

Hit 

1.52 

8 

Hit 

2.11 

9 

Hit 

-0.004 

10 

Hit 

0.21 

11 

Steal  3 

0.18 

12 

Steal  3 

0.34 

13 

Hit 

0.68 

14 

Hit 

0.98 

15 

Hit 

1.14 

16 

Hit 

1.57 

17 

Hit 

-0.05 

18 

Steal  2 

-0.03 

19 

Hit 

0.18 

20 

Steal  3 

0.10 

21 

Hit 

0.29 

22 

Hit 

0.45 

23 

Hit 

0.55 

24 

Hit 

0.81 

25 

New  Inning 

0.0 

13 


Number  of  policy  constraints  = 30 

4 

Number  of  feasible  policies  =4.6  X 10 

Number  of  iterations  = 3 

Optimal  gain  = 0.1017 

Optimal  policy  type  = constraint-sensitive 

State 

Decision 

Value  v. 

l 

1 

Hit 

0.10 

2 

Hit 

0.42 

3 

Steal  3 

0.31 

4 

Steal  3 

0.64 

5 

Hit 

0.89 

6 

Hit 

1.33 

7 

Hit 

1.52 

8 

Hit 

2.11 

9 

Hit 

-0.004 

10 

Hit 

0.21 

11 

Steal  3 

0.18 

12 

Steal  3 

0.34 

13 

Hit 

0.68 

14 

Hit 

0.98 

15 

Hit 

1.14 

16 

Hit 

1.57 

17 

Hit 

-0.05 

18 

Steal  2 

1 

o 

o 

CO 

19 

Hit 

0.18 

20 

Steal  3 

0.10 

21 

Hit 

0.29 

22 

Hit 

0.45 

23 

Hit 

0.55 

24 

Hit 

0.81 

25 

New  Inning 

0.0 

Number  of  policy  constraints  = 30 

4 

Number  of  feasible  policies  =4.6  X 10 

Number  of  iterations  = 3 

Optimal  gain  =0.0 

Optimal  policy  type  = constraint-sensitive 

State 

Dec is  ion 

Value 

1 

Hit 

0.62 

2 

Hit 

0.93 

3 

Steal  3 

0.87 

4 

Steal  3 

1.20 

5 

Hit 

1.39 

6 

Hit 

1.83 

7 

Hit 

2.01 

8 

Hit 

2.60 

9 

Hit 

0.35 

10 

Hit 

0.56 

11 

Steal  3 

0.57 

12 

Steal  3 

0.75 

13 

Hit 

1.01 

14 

Hit 

1.32 

15 

Hit 

1 .47 

16 

Hit 

1 .89 

17 

Hit 

0.12 

18 

Steal  2 

0.17 

19 

Hit 

0.35 

20 

Steal  3 

0.31 

21 

Hit 

0.47 

22 

Hit 

0.63 

23 

Hit 

0.72 

24 

Hit 

0.98 

25 

Trapped 

0.0 
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Chapter  VI 

CONCLUSIONS  AND  SUGGESTIONS  FOR  FUTURE  RESEARCH 

An  important  limitation  of  the  Markov  Decision  Process  as  a model 
for  practical  problems  has  been  overcome.  The  ability  to  deal  with 
policy  constraints  extends  the  applicability  of  the  model,  as  we  saw 
in  the  taxicab  and  baseball  examples.  Although  the  policy  constraints 
there  were  immediately  translated  into  algebraic  form,  we  also  have  the 
ability  to  express  complicated  constraints  via  the  algebra  of  events. 
We  showed,  however,  that  the  resultant  constraints  might  not  be  of  the 
simplest  form  possible.  An  area  worthy  of  future  research  would  be  to 
try  and  devise  a procedure  which  yields  simple  algebraic  expressions. 
The  algorithm  we  developed  could  be  used  to  order  the  policies  accord- 
ing to  gain  by  successively  making  the  optimal  policy  infeasible.  This 
would  involve  more  computational  effort  than  Nesbitt’s  [10]  procedure. 
However,  it  orders  risk-sensitive  policies  for  which  no  method  for  or- 
dering has  yet  been  devised.  An  interesting  research  would  be  to  seek 
a unified  approach  to  both  the  ordering  and  the  policy  constraint  prob- 
lems . 

Another  area  worthy  of  further  research  is  sensitivity  analysis. 
As  we  pointed  out,  the  values  of  the  constraints  are  interdependent.  To 
discover  which  constraints,  or  group  of  constraints  , are  responsible  for 
constraint-sensitivity  of  the  optimal  policy,  it  seems  fruitless  to  try 
using  our  algorithm  repeatedly  in  an  effort  to  exhaust  all  poss ible  con- 
straint combinations.  The  mere  bookkeeping  required  is  mind-boggling. 
Investigating  the  structural  interaction  between  the  coupled  states  dur- 
ing maximization  of  the  Lagrangian  would  probably  be  a better  approach. 
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